You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "uros-db (via GitHub)" <gi...@apache.org> on 2024/02/22 09:05:49 UTC

[PR] Collation support for built-in string functions: contains, startswith, endswith [spark]

uros-db opened a new pull request, #45216:
URL: https://github.com/apache/spark/pull/45216

   ### What changes were proposed in this pull request?
   Refactor built-in string functions to support collation for: contains, startsWith, endsWith.
   
   
   ### Why are the changes needed?
   Add collation support for built-in string functions in Spark.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, users should now be able to use COLLATE within arguments for built-in string functions: CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries.
   
   
   ### How was this patch tested?
   Unit tests for:
   - string expressions (StringExpressionsSuite)
   - queries using "collate" (CollationSuite)
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Yes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1509076371


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -183,6 +183,266 @@ class CollationSuite extends DatasourceV2SQLBase {
     }
   }
 
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left: String = "abc" // collate with 'UNICODE_CI'
+    val leftCollationName: String = "UNICODE_CI";
+    var right: String = null // collate with 'UNICODE'
+    val rightCollationName: String = "UNICODE";
+    // contains
+    right = left.substring(1, 2);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT contains(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // startsWith
+    right = left.substring(0, 1);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT startsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // endsWith
+    right = left.substring(2, 3);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT endsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+  }
+
+  test("Support contains string expression with Collation") {
+    // Test 'contains' with different collations
+    var listLeft: List[String] = List()
+    var listRight: List[String] = List()
+    var listResult: List[Boolean] = List()
+
+    // UCS_BASIC (default) & UNICODE collation
+    listLeft = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listRight = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, false, false, false, false, false,  //   c
+      true, true, true, false, false, false, false, false, false, false, false,   // abc
+      true, true, false, true, false, false, false, false, false, false, false,   //   cde
+      true, false, false, false, true, false, false, false, false, false, false,  // abde
+      true, true, true, true, false, true, false, false, false, false, false,     // abcde
+      true, false, false, false, false, false, true, false, false, false, false,  //   C
+      true, false, false, false, false, false, true, true, false, false, false,   // ABC
+      true, false, false, false, false, false, true, false, true, false, false,   //   CDE
+      true, false, false, false, false, false, false, false, false, true, false,  // ABDE
+      true, false, false, false, false, false, true, true, true, false, true)     // ABCDE

Review Comment:
   while it may seem a bit unusual, I think this matrix approach covers a broad spectrum of test cases and generally works really well for this set of functions - covering various edge-cases and different collation types (this was especially useful when debugging and experimenting with new collations) ex. imagine throwing Serbian (ć, Ć) or German collations (ä, Ä) into the mix with other possible `abc`s
   
   when I first wrote it as a standard linear set of tests, it was much harder to see how and why these functions behave the way they do with different collations, while this nicely aligned matrix gives a pretty clear overiview all in one place (in addition, previously it was easy to miss something and not cover all cases, and also hard-coding the expected results was extra-tedious)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1510723989


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -183,6 +183,266 @@ class CollationSuite extends DatasourceV2SQLBase {
     }
   }
 
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left: String = "abc" // collate with 'UNICODE_CI'
+    val leftCollationName: String = "UNICODE_CI";
+    var right: String = null // collate with 'UNICODE'
+    val rightCollationName: String = "UNICODE";
+    // contains
+    right = left.substring(1, 2);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT contains(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()

Review Comment:
   added a check in alaysis as well, so this can be removed



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -183,6 +183,266 @@ class CollationSuite extends DatasourceV2SQLBase {
     }
   }
 
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left: String = "abc" // collate with 'UNICODE_CI'
+    val leftCollationName: String = "UNICODE_CI";
+    var right: String = null // collate with 'UNICODE'
+    val rightCollationName: String = "UNICODE";
+    // contains
+    right = left.substring(1, 2);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT contains(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()

Review Comment:
   added a check in alaysis as well, so this is finally removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508867241


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +343,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.contains(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().contains(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"

Review Comment:
   You can also create a task in JIRA for this and add task id in comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL] Collations - support for built-in string functions: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1499616708


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -586,9 +584,17 @@ object ContainsExpressionBuilder extends StringBinaryPredicateExpressionBuilderB
 }
 
 case class Contains(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.contains(r)
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationID = right.dataType.asInstanceOf[StringType].collationId

Review Comment:
   Yeah, it does make sense to special case codegen here for DEFAULT_COLLATION_ID.
   
   And also yes, adding micro-benchmark is on the list of tasks. We can figure out who is going to pick up that part of the work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1503794242


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,237 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left = "abc" // collate with 'UNICODE_CI'
+    val right = "a" // collate with 'UNICODE'
+    var exception: IllegalArgumentException = null
+    // contains
+    exception = intercept[IllegalArgumentException] {

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #45216: [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith
URL: https://github.com/apache/spark/pull/45216


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502735941


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -586,9 +584,17 @@ object ContainsExpressionBuilder extends StringBinaryPredicateExpressionBuilderB
 }
 
 case class Contains(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.contains(r)
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationID = right.dataType.asInstanceOf[StringType].collationId

Review Comment:
   Makes sense! Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502774167


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -559,6 +557,19 @@ case class BinaryPredicate(override val prettyName: String, left: Expression, ri
   }
 }
 
+trait CollationCheck {
+  self: BinaryExpression =>

Review Comment:
   I think that you should implement this against `StringPredicate` instead of `BinaryExpression`. If it is just `BinaryExpression` you don't have any guarantee that `asInstanceOf[StringType]` will work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502937584


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +340,15 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) {
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {

Review Comment:
   It's fine to have a short circuit and have `stringExpression` directly call `containsBinary`. I just think that  `contains` should work for all collations, including `UCS_BASIC`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1505450164


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.containsBinary(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().containsBinary(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"
+    Map<String, String> params = new HashMap<>();
+    params.put("functionName", Thread.currentThread().getStackTrace()[1].getMethodName());
+    params.put("collationName", CollationFactory.fetchCollation(collationId).collationName);
+    throw new SparkException("COLLATION_NOT_SUPPORTED_FOR_FUNCTION",

Review Comment:
   yes, will implement "contains" for non-binary collations in a future PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508920123


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -497,10 +497,33 @@ case class Lower(child: Expression)
 abstract class StringPredicate extends BinaryExpression
   with Predicate with ImplicitCastInputTypes with NullIntolerant {
 
+  final val collationId: Int = left.dataType.asInstanceOf[StringType].collationId
+
   def compare(l: UTF8String, r: UTF8String): Boolean
 
   override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val checkResult = super.checkInputDataTypes()
+    if (checkResult.isFailure) {
+      return checkResult
+    }
+    // Additional check needed for collation compatibility
+    val rightCollationId: Int = right.dataType.asInstanceOf[StringType].collationId
+    if (collationId != rightCollationId) {
+      throw new SparkException(

Review Comment:
   we should return `DataTypeMismatch` instead of failing directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1505460392


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -105,6 +105,9 @@ public Collation(
   private static final Collation[] collationTable = new Collation[4];
   private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
 
+  public static final int DEFAULT_COLLATION_ID = 0;
+  public static final int LOWERCASE_COLLATION_ID = 1;

Review Comment:
   for now, `DEFAULT_COLLATION_ID ` is mostly used outside of CollationFactory (e.g. StringType, ColumnType, PhysicalDataType), so there's not too many places where a "0" is hardcoded instead
   
   while I suppose we could also do `collationTable[DEFAULT_COLLATION_ID] = ...`, it wouldn't really make too much sense as only `0` and `1` are special cases (while `2`, `3`, and so on are not), and I also suppose collationTable initialization itself will evolve soon with respect to other collation types so it hardly matters
   
   that said, I could change it for now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1505462011


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -105,6 +105,9 @@ public Collation(
   private static final Collation[] collationTable = new Collation[4];
   private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
 
+  public static final int DEFAULT_COLLATION_ID = 0;
+  public static final int LOWERCASE_COLLATION_ID = 1;

Review Comment:
   got it, we can leave it for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508868584


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -355,14 +371,43 @@ public boolean matchAt(final UTF8String s, int pos) {
     return ByteArrayMethods.arrayEquals(base, offset + pos, s.base, s.offset, s.numBytes);
   }
 
+  private boolean matchAt(final UTF8String s, int pos, int collationId) {
+    if (s.numBytes + pos > numBytes || pos < 0) {
+      return false;
+    }
+    return this.substring(pos, pos + s.numBytes).semanticCompare(s, collationId) == 0;
+  }
+
   public boolean startsWith(final UTF8String prefix) {
     return matchAt(prefix, 0);
   }
 
+  public boolean startsWith(final UTF8String prefix, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.startsWith(prefix);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().startsWith(prefix.toLowerCase());
+    }
+    // ICU collation support

Review Comment:
   I think that it is confusing. Either delete it or make it useful to reader (e.g. "fallback to matchAt that will do collation aware comparison through ICU library").



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502735262


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -66,6 +66,13 @@ public static class Collation {
      */
     public final boolean isBinaryCollation;
 
+    /**
+     * Built-in string functions with collation support.

Review Comment:
   Update: as agreed in another comment thread, routing is now partially moved over to UTF8String, while binary collations are resolved right away in Scala in order to prevent performance regression



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502762736


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -105,6 +105,9 @@ public Collation(
   private static final Collation[] collationTable = new Collation[4];
   private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
 
+  public static final int DEFAULT_COLLATION_ID = 0;

Review Comment:
   Let's avoid keeping COLLATION_IDs in both StringType and here. Maybe having an enum in CollationFactory is a good option?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502783545


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -622,10 +646,24 @@ object StartsWithExpressionBuilder extends StringBinaryPredicateExpressionBuilde
   }
 }
 
-case class StartsWith(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.startsWith(r)
+case class StartsWith(left: Expression, right: Expression) extends StringPredicate
+  with CollationCheck {
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationId = checkCollationIds()
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      l.startsWith(r)

Review Comment:
   Let's think about naming here. I don't like the idea that startsWith without collation implies binary check. Can we call these methods `startsWithBinary`/`containsBinary` etc.?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508917781


##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -61,6 +61,5 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa
  */
 @Stable
 case object StringType extends StringType(0) {
-  val DEFAULT_COLLATION_ID = 0
   def apply(collationId: Int): StringType = new StringType(collationId)

Review Comment:
   @dbatomic public API should take collation name instead of collation id.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508873927


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -586,9 +608,21 @@ object ContainsExpressionBuilder extends StringBinaryPredicateExpressionBuilderB
 }
 
 case class Contains(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.contains(r)
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationId: Int = left.dataType.asInstanceOf[StringType].collationId

Review Comment:
   Can you push collationId as class property, outside of of `compare`?
   
   That way you can share it between `doGenCode` and `compare`  + it should be more efficient to do cast once then per row. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1510723361


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -183,6 +183,266 @@ class CollationSuite extends DatasourceV2SQLBase {
     }
   }
 
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left: String = "abc" // collate with 'UNICODE_CI'
+    val leftCollationName: String = "UNICODE_CI";
+    var right: String = null // collate with 'UNICODE'
+    val rightCollationName: String = "UNICODE";
+    // contains
+    right = left.substring(1, 2);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT contains(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // startsWith
+    right = left.substring(0, 1);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT startsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // endsWith
+    right = left.substring(2, 3);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT endsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+  }
+
+  test("Support contains string expression with Collation") {
+    // Test 'contains' with different collations
+    var listLeft: List[String] = List()
+    var listRight: List[String] = List()
+    var listResult: List[Boolean] = List()
+
+    // UCS_BASIC (default) & UNICODE collation
+    listLeft = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listRight = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, false, false, false, false, false,  //   c
+      true, true, true, false, false, false, false, false, false, false, false,   // abc
+      true, true, false, true, false, false, false, false, false, false, false,   //   cde
+      true, false, false, false, true, false, false, false, false, false, false,  // abde
+      true, true, true, true, false, true, false, false, false, false, false,     // abcde
+      true, false, false, false, false, false, true, false, false, false, false,  //   C
+      true, false, false, false, false, false, true, true, false, false, false,   // ABC
+      true, false, false, false, false, false, true, false, true, false, false,   //   CDE
+      true, false, false, false, false, false, false, false, false, true, false,  // ABDE
+      true, false, false, false, false, false, true, true, true, false, true)     // ABCDE

Review Comment:
   looks great, thanks everyone - fixed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502765099


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +340,15 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) {
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {

Review Comment:
   What if we pass DEFAULT_COLLATION_ID here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1505461280


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.containsBinary(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().containsBinary(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"
+    Map<String, String> params = new HashMap<>();
+    params.put("functionName", Thread.currentThread().getStackTrace()[1].getMethodName());
+    params.put("collationName", CollationFactory.fetchCollation(collationId).collationName);
+    throw new SparkException("COLLATION_NOT_SUPPORTED_FOR_FUNCTION",

Review Comment:
   then we can throw an internal error for now, without an error class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1505471123


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.containsBinary(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().containsBinary(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"
+    Map<String, String> params = new HashMap<>();
+    params.put("functionName", Thread.currentThread().getStackTrace()[1].getMethodName());
+    params.put("collationName", CollationFactory.fetchCollation(collationId).collationName);
+    throw new SparkException("COLLATION_NOT_SUPPORTED_FOR_FUNCTION",

Review Comment:
   I think the logic here was that there are gonna be some functions that are not (yet and/or ever) supported for some particular collation types (some might get added at one point in the near future, while others may not), so we will eventually need a way to let users know. that's why we added the `COLLATION_NOT_SUPPORTED_FOR_FUNCTION` error class - while it will likely get removed from here, there are other places where it will appear instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1506037403


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -331,7 +333,6 @@ public boolean contains(final UTF8String substring) {
     if (substring.numBytes == 0) {
       return true;
     }
-

Review Comment:
   Let's avoid unneeded changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502795006


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -622,10 +646,24 @@ object StartsWithExpressionBuilder extends StringBinaryPredicateExpressionBuilde
   }
 }
 
-case class StartsWith(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.startsWith(r)
+case class StartsWith(left: Expression, right: Expression) extends StringPredicate
+  with CollationCheck {
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationId = checkCollationIds()
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      l.startsWith(r)

Review Comment:
   You're right. "startsWith" was the original name for the binary method (i.e. the only one supported at the time). Now, it would probably be best to change it to "startsWithBinary" (and probably "startsWithNonBinary" for the other one)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1503800082


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +340,15 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) {
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {

Review Comment:
   Ahh, I see. Makes sense, I do agree that `contains(final UTF8String substring, int collationId)` should be able to handle all collations, even if it's not currently invoked for DEFAULT_COLLATION_ID. I added this in the latest commit!
   
   However, that same default option that works for DEFAULT_COLLATION_ID (UCS_BASIC) also works for other binary collations supported by CollationFactory (currently: UNICODE).
   
   I suppose we should go ahead and cover them all here using:
   `if (CollationFactory.fetchCollation(collationId).isBinaryCollation)`
   instead of just
   ` if (collationId == CollationFactory.DEFAULT_COLLATION_ID)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP] Collation support for built-in string functions: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1498988909


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -66,6 +66,13 @@ public static class Collation {
      */
     public final boolean isBinaryCollation;
 
+    /**
+     * Built-in string functions with collation support.

Review Comment:
   Instead of adding new functions here I would propose following:
   1) Add only low level basic functions here (e.g. matchAt).
   2) Extend UTF8String with   `public boolean startsWith(final UTF8String prefix, *collationId*)`.  Have all the routing happen there.
   3) Short circuit if we have `DEFAULT_COLLATION_ID` to previous implementation (e.g. (startsWith(final UTF8String prefix)). That way we will not risk regressing previous version.
   4) Have operators call startsWith/endsWith... with proper collationId.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL] Collations - support for built-in string functions: contains, startswith, endswith [spark]

Posted by "mitkedb (via GitHub)" <gi...@apache.org>.
mitkedb commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1499331276


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -586,9 +584,17 @@ object ContainsExpressionBuilder extends StringBinaryPredicateExpressionBuilderB
 }
 
 case class Contains(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.contains(r)
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationID = right.dataType.asInstanceOf[StringType].collationId

Review Comment:
   At the minimum, the codegen code for UTF8 binary should be identical before / after our collation changes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1510755527


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -497,10 +497,32 @@ case class Lower(child: Expression)
 abstract class StringPredicate extends BinaryExpression
   with Predicate with ImplicitCastInputTypes with NullIntolerant {
 
+  final val collationId: Int = left.dataType.asInstanceOf[StringType].collationId

Review Comment:
   I think it should be a lazy val, in case we fail with java cast issue before the `checkInputDataTypes` is called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1510829030


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -586,9 +608,38 @@ object ContainsExpressionBuilder extends StringBinaryPredicateExpressionBuilderB
 }
 
 case class Contains(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.contains(r)
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val checkResult = super.checkInputDataTypes()
+    if (checkResult.isFailure) {
+      return checkResult
+    }
+    // Additional check needed for collation support
+    if (!CollationFactory.fetchCollation(collationId).isBinaryCollation
+      && collationId != CollationFactory.LOWERCASE_COLLATION_ID) {

Review Comment:
   @dbatomic we should unify the naming in followup. binary stable, binary comparable, default collation, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1509076371


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -183,6 +183,266 @@ class CollationSuite extends DatasourceV2SQLBase {
     }
   }
 
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left: String = "abc" // collate with 'UNICODE_CI'
+    val leftCollationName: String = "UNICODE_CI";
+    var right: String = null // collate with 'UNICODE'
+    val rightCollationName: String = "UNICODE";
+    // contains
+    right = left.substring(1, 2);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT contains(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // startsWith
+    right = left.substring(0, 1);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT startsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // endsWith
+    right = left.substring(2, 3);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT endsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+  }
+
+  test("Support contains string expression with Collation") {
+    // Test 'contains' with different collations
+    var listLeft: List[String] = List()
+    var listRight: List[String] = List()
+    var listResult: List[Boolean] = List()
+
+    // UCS_BASIC (default) & UNICODE collation
+    listLeft = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listRight = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, false, false, false, false, false,  //   c
+      true, true, true, false, false, false, false, false, false, false, false,   // abc
+      true, true, false, true, false, false, false, false, false, false, false,   //   cde
+      true, false, false, false, true, false, false, false, false, false, false,  // abde
+      true, true, true, true, false, true, false, false, false, false, false,     // abcde
+      true, false, false, false, false, false, true, false, false, false, false,  //   C
+      true, false, false, false, false, false, true, true, false, false, false,   // ABC
+      true, false, false, false, false, false, true, false, true, false, false,   //   CDE
+      true, false, false, false, false, false, false, false, false, true, false,  // ABDE
+      true, false, false, false, false, false, true, true, true, false, true)     // ABCDE

Review Comment:
   while it may seem a bit unusual, I think this matrix approach covers a broad spectrum of test cases and generally works really well for this set of functions - covering various edge-cases and different collation types (this was especially useful when debugging and experimenting with new collations)
   
   when I first wrote it as a standard linear set of tests, it was much harder to see how and why these functions behvae the way they do with different collations, while this nicely aligned matrix gives a pretty clear explanation all in one place



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502789593


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,237 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left = "abc" // collate with 'UNICODE_CI'
+    val right = "a" // collate with 'UNICODE'
+    var exception: IllegalArgumentException = null
+    // contains
+    exception = intercept[IllegalArgumentException] {

Review Comment:
   you should use `checkError` for this. E.g. see `  test("collate function syntax invalid arg count")`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508672674


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -331,7 +333,6 @@ public boolean contains(final UTF8String substring) {
     if (substring.numBytes == 0) {
       return true;
     }
-

Review Comment:
   got it, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1509083011


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -183,6 +183,266 @@ class CollationSuite extends DatasourceV2SQLBase {
     }
   }
 
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left: String = "abc" // collate with 'UNICODE_CI'
+    val leftCollationName: String = "UNICODE_CI";
+    var right: String = null // collate with 'UNICODE'
+    val rightCollationName: String = "UNICODE";
+    // contains
+    right = left.substring(1, 2);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT contains(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // startsWith
+    right = left.substring(0, 1);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT startsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // endsWith
+    right = left.substring(2, 3);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT endsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+  }
+
+  test("Support contains string expression with Collation") {
+    // Test 'contains' with different collations
+    var listLeft: List[String] = List()
+    var listRight: List[String] = List()
+    var listResult: List[Boolean] = List()
+
+    // UCS_BASIC (default) & UNICODE collation
+    listLeft = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listRight = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, false, false, false, false, false,  //   c
+      true, true, true, false, false, false, false, false, false, false, false,   // abc
+      true, true, false, true, false, false, false, false, false, false, false,   //   cde
+      true, false, false, false, true, false, false, false, false, false, false,  // abde
+      true, true, true, true, false, true, false, false, false, false, false,     // abcde
+      true, false, false, false, false, false, true, false, false, false, false,  //   C
+      true, false, false, false, false, false, true, true, false, false, false,   // ABC
+      true, false, false, false, false, false, true, false, true, false, false,   //   CDE
+      true, false, false, false, false, false, false, false, false, true, false,  // ABDE
+      true, false, false, false, false, false, true, true, true, false, true)     // ABCDE

Review Comment:
   Can we do case class checks, similarly to what we were doing in `CollationFactorySuite.scala`?
   
   ```scala
     case class CollationTestCase[R](collationName: String, s1: String, s2: String, expectedResult: R)
   
     test("collation aware equality and hash") {
       val checks = Seq(
         CollationTestCase("UCS_BASIC", "aaa", "aaa", true),
         CollationTestCase("UCS_BASIC", "aaa", "AAA", false),
         CollationTestCase("UCS_BASIC", "aaa", "bbb", false),
         CollationTestCase("UCS_BASIC_LCASE", "aaa", "aaa", true),
         CollationTestCase("UCS_BASIC_LCASE", "aaa", "AAA", true),
         CollationTestCase("UCS_BASIC_LCASE", "aaa", "AaA", true),
         CollationTestCase("UCS_BASIC_LCASE", "aaa", "AaA", true),
         CollationTestCase("UCS_BASIC_LCASE", "aaa", "aa", false),
         CollationTestCase("UCS_BASIC_LCASE", "aaa", "bbb", false),
         CollationTestCase("UNICODE", "aaa", "aaa", true),
         CollationTestCase("UNICODE", "aaa", "AAA", false),
         CollationTestCase("UNICODE", "aaa", "bbb", false),
         CollationTestCase("UNICODE_CI", "aaa", "aaa", true),
         CollationTestCase("UNICODE_CI", "aaa", "AAA", true),
         CollationTestCase("UNICODE_CI", "aaa", "bbb", false))
   ```
   
   Also, having so many test cases may indeed be overkill :) I think that:
   1) empty string
   2/3) single char upper + lower
   4/5) multichar upper + lower
   6) Seq that is not contained anywhere.
   
   Would suffice.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1509069300


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -183,6 +183,266 @@ class CollationSuite extends DatasourceV2SQLBase {
     }
   }
 
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left: String = "abc" // collate with 'UNICODE_CI'
+    val leftCollationName: String = "UNICODE_CI";
+    var right: String = null // collate with 'UNICODE'
+    val rightCollationName: String = "UNICODE";
+    // contains
+    right = left.substring(1, 2);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT contains(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()

Review Comment:
   I think I need .collect() in this case, otherwise it doesn't produce an exception here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1509075313


##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -61,6 +61,5 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa
  */
 @Stable
 case object StringType extends StringType(0) {
-  val DEFAULT_COLLATION_ID = 0
   def apply(collationId: Int): StringType = new StringType(collationId)

Review Comment:
   Ack. We will follow up on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1505448239


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -105,6 +105,9 @@ public Collation(
   private static final Collation[] collationTable = new Collation[4];
   private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
 
+  public static final int DEFAULT_COLLATION_ID = 0;
+  public static final int LOWERCASE_COLLATION_ID = 1;

Review Comment:
   shall we use these constants in places like `collationTable[0]`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508704687


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.contains(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().contains(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"
+    Map<String, String> params = new HashMap<>();
+    params.put("functionName", Thread.currentThread().getStackTrace()[1].getMethodName());

Review Comment:
   agreed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1509069300


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -183,6 +183,266 @@ class CollationSuite extends DatasourceV2SQLBase {
     }
   }
 
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left: String = "abc" // collate with 'UNICODE_CI'
+    val leftCollationName: String = "UNICODE_CI";
+    var right: String = null // collate with 'UNICODE'
+    val rightCollationName: String = "UNICODE";
+    // contains
+    right = left.substring(1, 2);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT contains(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()

Review Comment:
   I think I need .collect() in this case, otherwise it doesn't produce an exception here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL] Collations - support for built-in string functions: contains, startswith, endswith [spark]

Posted by "mitkedb (via GitHub)" <gi...@apache.org>.
mitkedb commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1499330134


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -586,9 +584,17 @@ object ContainsExpressionBuilder extends StringBinaryPredicateExpressionBuilderB
 }
 
 case class Contains(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.contains(r)
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationID = right.dataType.asInstanceOf[StringType].collationId

Review Comment:
   @dbatomic @uros-db 
   
   Should we consider special casing the path for the default UTF8 binary collation, to prevent any potential performance regression in the default code path?
   
   Something like the following:
   
   if right.dataType.isDefaultCollationFinalVariable (it is important that the variable is final as it will result in significantly better optimizations by JIT IIRC)
      default / current implementation
   else
      collation implementation
      
      
   And the codegen version of the code should remain identical to how it was before we added collations... I am guessing we can check the collation ID at compile time, and just generate exactly the same codegen code for UTF8 binary
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502762736


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -105,6 +105,9 @@ public Collation(
   private static final Collation[] collationTable = new Collation[4];
   private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
 
+  public static final int DEFAULT_COLLATION_ID = 0;

Review Comment:
   Let's avoid keeping COLLATION_IDs in both StringType and here. Maybe having an enum in CollationFactory is a good option?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502791434


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +340,15 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) {
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {

Review Comment:
   That was the case previously. However, as of the last commit, that part is handled in stringExpressions.scala in order not to regress the performance for the original (current) implementation of contains/startsWith/endsWith that uses the default collation. As we discussed somewhere above, it's imperative to take care of that case early over there, while all non-binary collations pass through to UTF8String.java



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1504086524


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -622,10 +646,24 @@ object StartsWithExpressionBuilder extends StringBinaryPredicateExpressionBuilde
   }
 }
 
-case class StartsWith(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.startsWith(r)
+case class StartsWith(left: Expression, right: Expression) extends StringPredicate
+  with CollationCheck {
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationId = checkCollationIds()
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      l.startsWith(r)

Review Comment:
   Going with:
   (old / current implementation - binary collations only) `containsBinary(substring)`
   (new implementation - can be used for all collations) `contains(substring, collationId)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "mkaravel (via GitHub)" <gi...@apache.org>.
mkaravel commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1505438346


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -559,6 +559,25 @@ case class BinaryPredicate(override val prettyName: String, left: Expression, ri
   }
 }
 
+trait CollationCheck {

Review Comment:
   Rename to `CollationCompatibilityCheck` maybe to make it more clear what it does?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1505449019


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.containsBinary(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().containsBinary(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"
+    Map<String, String> params = new HashMap<>();
+    params.put("functionName", Thread.currentThread().getStackTrace()[1].getMethodName());
+    params.put("collationName", CollationFactory.fetchCollation(collationId).collationName);
+    throw new SparkException("COLLATION_NOT_SUPPORTED_FOR_FUNCTION",

Review Comment:
   is this temporary?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508803718


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -585,10 +604,23 @@ object ContainsExpressionBuilder extends StringBinaryPredicateExpressionBuilderB
   }
 }
 
-case class Contains(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.contains(r)
+case class Contains(left: Expression, right: Expression) extends StringPredicate
+  with CollationCompatibilityCheck {
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationId = checkCollationIds()

Review Comment:
   fixed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508925352


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -183,6 +183,266 @@ class CollationSuite extends DatasourceV2SQLBase {
     }
   }
 
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left: String = "abc" // collate with 'UNICODE_CI'
+    val leftCollationName: String = "UNICODE_CI";
+    var right: String = null // collate with 'UNICODE'
+    val rightCollationName: String = "UNICODE";
+    // contains
+    right = left.substring(1, 2);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT contains(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // startsWith
+    right = left.substring(0, 1);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT startsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // endsWith
+    right = left.substring(2, 3);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT endsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+  }
+
+  test("Support contains string expression with Collation") {
+    // Test 'contains' with different collations
+    var listLeft: List[String] = List()
+    var listRight: List[String] = List()
+    var listResult: List[Boolean] = List()
+
+    // UCS_BASIC (default) & UNICODE collation
+    listLeft = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listRight = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, false, false, false, false, false,  //   c
+      true, true, true, false, false, false, false, false, false, false, false,   // abc
+      true, true, false, true, false, false, false, false, false, false, false,   //   cde
+      true, false, false, false, true, false, false, false, false, false, false,  // abde
+      true, true, true, true, false, true, false, false, false, false, false,     // abcde
+      true, false, false, false, false, false, true, false, false, false, false,  //   C
+      true, false, false, false, false, false, true, true, false, false, false,   // ABC
+      true, false, false, false, false, false, true, false, true, false, false,   //   CDE
+      true, false, false, false, false, false, false, false, false, true, false,  // ABDE
+      true, false, false, false, false, false, true, true, true, false, true)     // ABCDE

Review Comment:
   This is hard to review...  Can we just provide a few typical cases to test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45216:
URL: https://github.com/apache/spark/pull/45216#issuecomment-1975976208

   @uros-db  unfortunately it has code conflicts now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508881130


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +343,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.contains(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().contains(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"

Review Comment:
   done



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +343,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.contains(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().contains(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"

Review Comment:
   done [SPARK-47248](https://issues.apache.org/jira/browse/SPARK-47248)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP] Collation support for built-in string functions: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1498999850


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,291 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+  test("Support contains string expression with Collation") {
+    // Test 'contains' with different collations
+    var listLeft: List[String] = List()
+    var listRight: List[String] = List()
+    var listResult: List[Boolean] = List()
+
+    // UCS_BASIC (default) & UNICODE collation
+    listLeft = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listRight = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, false, false, false, false, false,  //   c
+      true, true, true, false, false, false, false, false, false, false, false,   // abc
+      true, true, false, true, false, false, false, false, false, false, false,   //   cde
+      true, false, false, false, true, false, false, false, false, false, false,  // abde
+      true, true, true, true, false, true, false, false, false, false, false,     // abcde
+      true, false, false, false, false, false, true, false, false, false, false,  //   C
+      true, false, false, false, false, false, true, true, false, false, false,   // ABC
+      true, false, false, false, false, false, true, false, true, false, false,   //   CDE
+      true, false, false, false, false, false, false, false, false, true, false,  // ABDE
+      true, false, false, false, false, false, true, true, true, false, true)     // ABCDE
+    for {
+      (left, index_left) <- listLeft.zipWithIndex
+      (right, index_right) <- listRight.zipWithIndex
+    } {
+      val expectedAnswer = listResult(index_left * listRight.length + index_right)
+      // UCS_BASIC (default)
+      checkAnswer(sql("SELECT contains('" + left + "', '" + right + "')"), Row(expectedAnswer))
+      // UCS_BASIC
+      checkAnswer(sql("SELECT contains('" + left + "', collate('" +
+        right + "', 'UCS_BASIC'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT contains(collate('" + left + "', 'UCS_BASIC'), collate('" +
+        right + "', 'UCS_BASIC'))"), Row(expectedAnswer))
+      // UNICODE
+      checkAnswer(sql("SELECT contains('" + left + "', collate('" +
+        right + "', 'UNICODE'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT contains(collate('" + left + "', 'UNICODE'), collate('" +
+        right + "', 'UNICODE'))"), Row(expectedAnswer))
+    }
+
+
+    // UCS_BASIC_LCASE & UNICODE_CI collation
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, true, false, false, false, false,   //   c
+      true, true, true, false, false, false, true, true, false, false, false,     // abc
+      true, true, false, true, false, false, true, false, true, false, false,     //    cde
+      true, false, false, false, true, false, false, false, false, true, false,   // abde
+      true, true, true, true, false, true, true, true, true, false, true,         // abcde
+      true, true, false, false, false, false, true, false, false, false, false,   //   C
+      true, true, true, false, false, false, true, true, false, false, false,     // ABC
+      true, true, false, true, false, false, true, false, true, false, false,     //   CDE
+      true, false, false, false, true, false, false, false, false, true, false,   // ABDE
+      true, true, true, true, false, true, true, true, true, false, true)         // ABCDE
+    for {
+      (left, index_left) <- listLeft.zipWithIndex
+      (right, index_right) <- listRight.zipWithIndex
+    } {
+      val expectedAnswer = listResult(index_left * listRight.length + index_right)
+      // UCS_BASIC_LCASE
+      checkAnswer(sql("SELECT contains('" + left + "', collate('" +
+        right + "', 'UCS_BASIC_LCASE'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT contains(collate('" + left + "', 'UCS_BASIC_LCASE'), collate('" +
+        right + "', 'UCS_BASIC_LCASE'))"), Row(expectedAnswer))
+      // UNICODE_CI
+      checkAnswer(sql("SELECT contains('" + left + "', collate('" +
+        right + "', 'UNICODE_CI'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT contains(collate('" + left + "', 'UNICODE_CI'), collate('" +
+        right + "', 'UNICODE_CI'))"), Row(expectedAnswer))
+    }
+  }
+
+    test("Support startsWith string expression with Collation") {
+    // Test 'startsWith' with different collations
+    var listLeft: List[String] = List()
+    var listRight: List[String] = List()
+    var listResult: List[Boolean] = List()
+
+    // UCS_BASIC (default) & UNICODE collation
+    listLeft = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listRight = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, false, false, false, false, false,  //   c
+      true, false, true, false, false, false, false, false, false, false, false,  // abc
+      true, true, false, true, false, false, false, false, false, false, false,   //   cde
+      true, false, false, false, true, false, false, false, false, false, false,  // abde
+      true, false, true, false, false, true, false, false, false, false, false,   // abcde
+      true, false, false, false, false, false, true, false, false, false, false,  //   C
+      true, false, false, false, false, false, false, true, false, false, false,  // ABC
+      true, false, false, false, false, false, true, false, true, false, false,   //   CDE
+      true, false, false, false, false, false, false, false, false, true, false,  // ABDE
+      true, false, false, false, false, false, false, true, false, false, true)   // ABCDE
+    for {
+      (left, index_left) <- listLeft.zipWithIndex
+      (right, index_right) <- listRight.zipWithIndex
+    } {
+      val expectedAnswer = listResult(index_left * listRight.length + index_right)
+      // UCS_BASIC (default)
+      checkAnswer(sql("SELECT startswith('" + left + "', '" + right + "')"), Row(expectedAnswer))
+      // UCS_BASIC
+      checkAnswer(sql("SELECT startswith('" + left + "', collate('" +
+        right + "', 'UCS_BASIC'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT startswith(collate('" + left + "', 'UCS_BASIC'), collate('" +
+        right + "', 'UCS_BASIC'))"), Row(expectedAnswer))
+      // UNICODE
+      checkAnswer(sql("SELECT startswith('" + left + "', collate('" +
+        right + "', 'UNICODE'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT startswith(collate('" + left + "', 'UNICODE'), collate('" +
+        right + "', 'UNICODE'))"), Row(expectedAnswer))
+    }
+
+    // UCS_BASIC_LCASE & UNICODE_CI collation
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, true, false, false, false, false,   //   c
+      true, false, true, false, false, false, false, true, false, false, false,   // abc
+      true, true, false, true, false, false, true, false, true, false, false,     //   cde
+      true, false, false, false, true, false, false, false, false, true, false,   // abde
+      true, false, true, false, false, true, false, true, false, false, true,     // abcde
+      true, true, false, false, false, false, true, false, false, false, false,   //   C
+      true, false, true, false, false, false, false, true, false, false, false,   // ABC
+      true, true, false, true, false, false, true, false, true, false, false,     //   CDE
+      true, false, false, false, true, false, false, false, false, true, false,   // ABDE
+      true, false, true, false, false, true, false, true, false, false, true)     // ABCDE
+    for {
+      (left, index_left) <- listLeft.zipWithIndex
+      (right, index_right) <- listRight.zipWithIndex
+    } {
+      val expectedAnswer = listResult(index_left * listRight.length + index_right)
+      // UCS_BASIC_LCASE
+      checkAnswer(sql("SELECT startswith('" + left + "', collate('" +
+        right + "', 'UCS_BASIC_LCASE'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT startswith(collate('" + left + "', 'UCS_BASIC_LCASE'), collate('" +
+        right + "', 'UCS_BASIC_LCASE'))"), Row(expectedAnswer))
+      // UNICODE_CI
+      checkAnswer(sql("SELECT startswith('" + left + "', collate('" +
+        right + "', 'UNICODE_CI'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT startswith(collate('" + left + "', 'UNICODE_CI'), collate('" +
+        right + "', 'UNICODE_CI'))"), Row(expectedAnswer))
+    }
+
+    // Serbian language collation tests

Review Comment:
   Of course! Removed in new commit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL] Collations - support for built-in string functions: contains, startswith, endswith [spark]

Posted by "mitkedb (via GitHub)" <gi...@apache.org>.
mitkedb commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1499334642


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -586,9 +584,17 @@ object ContainsExpressionBuilder extends StringBinaryPredicateExpressionBuilderB
 }
 
 case class Contains(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.contains(r)
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationID = right.dataType.asInstanceOf[StringType].collationId

Review Comment:
   Maybe we can have someone (e.g. @nikolamand-db ) build a java micro-benchmark in parallel to the UDF impl to assess the performance of the Java/JITed code, and we make decisions based on that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502792597


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -559,6 +557,19 @@ case class BinaryPredicate(override val prettyName: String, left: Expression, ri
   }
 }
 
+trait CollationCheck {
+  self: BinaryExpression =>
+  def checkCollationIds(): Int = {
+    val leftCollationId: Int = left.dataType.asInstanceOf[StringType].collationId
+    val rightCollationId: Int = right.dataType.asInstanceOf[StringType].collationId
+    if (leftCollationId != rightCollationId) {
+      throw new IllegalArgumentException(
+        "Function requires the same collation type for left and right strings.")

Review Comment:
   On it!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502777283


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -559,6 +557,19 @@ case class BinaryPredicate(override val prettyName: String, left: Expression, ri
   }
 }
 
+trait CollationCheck {
+  self: BinaryExpression =>

Review Comment:
   Btw, good job adding this trait. We do need this!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1505450164


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.containsBinary(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().containsBinary(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"
+    Map<String, String> params = new HashMap<>();
+    params.put("functionName", Thread.currentThread().getStackTrace()[1].getMethodName());
+    params.put("collationName", CollationFactory.fetchCollation(collationId).collationName);
+    throw new SparkException("COLLATION_NOT_SUPPORTED_FOR_FUNCTION",

Review Comment:
   yes, will implement "contains" for non-binary non-lowercase collations in a future PR, as it's a bit more involved at the moment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1506039038


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.contains(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().contains(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"
+    Map<String, String> params = new HashMap<>();
+    params.put("functionName", Thread.currentThread().getStackTrace()[1].getMethodName());

Review Comment:
   I would prefer simple `contains` instead of pulling function name from stack :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1506054121


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -585,10 +604,23 @@ object ContainsExpressionBuilder extends StringBinaryPredicateExpressionBuilderB
   }
 }
 
-case class Contains(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.contains(r)
+case class Contains(left: Expression, right: Expression) extends StringPredicate
+  with CollationCompatibilityCheck {
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationId = checkCollationIds()

Review Comment:
   collation id should be checked only once during analysis, not per row. Please check `Expression.checkInputDataTypes`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508674676


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -355,14 +371,43 @@ public boolean matchAt(final UTF8String s, int pos) {
     return ByteArrayMethods.arrayEquals(base, offset + pos, s.base, s.offset, s.numBytes);
   }
 
+  private boolean matchAt(final UTF8String s, int pos, int collationId) {
+    if (s.numBytes + pos > numBytes || pos < 0) {
+      return false;
+    }
+    return this.substring(pos, pos + s.numBytes).semanticCompare(s, collationId) == 0;
+  }
+
   public boolean startsWith(final UTF8String prefix) {
     return matchAt(prefix, 0);
   }
 
+  public boolean startsWith(final UTF8String prefix, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.startsWith(prefix);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().startsWith(prefix.toLowerCase());
+    }
+    // ICU collation support

Review Comment:
   no, this one is implemented
   if the comment is confusing, I could delete it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508674676


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -355,14 +371,43 @@ public boolean matchAt(final UTF8String s, int pos) {
     return ByteArrayMethods.arrayEquals(base, offset + pos, s.base, s.offset, s.numBytes);
   }
 
+  private boolean matchAt(final UTF8String s, int pos, int collationId) {
+    if (s.numBytes + pos > numBytes || pos < 0) {
+      return false;
+    }
+    return this.substring(pos, pos + s.numBytes).semanticCompare(s, collationId) == 0;
+  }
+
   public boolean startsWith(final UTF8String prefix) {
     return matchAt(prefix, 0);
   }
 
+  public boolean startsWith(final UTF8String prefix, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.startsWith(prefix);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().startsWith(prefix.toLowerCase());
+    }
+    // ICU collation support

Review Comment:
   no, this one is implemented



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508872841


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -355,14 +371,43 @@ public boolean matchAt(final UTF8String s, int pos) {
     return ByteArrayMethods.arrayEquals(base, offset + pos, s.base, s.offset, s.numBytes);
   }
 
+  private boolean matchAt(final UTF8String s, int pos, int collationId) {
+    if (s.numBytes + pos > numBytes || pos < 0) {
+      return false;
+    }
+    return this.substring(pos, pos + s.numBytes).semanticCompare(s, collationId) == 0;
+  }
+
   public boolean startsWith(final UTF8String prefix) {
     return matchAt(prefix, 0);
   }
 
+  public boolean startsWith(final UTF8String prefix, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.startsWith(prefix);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().startsWith(prefix.toLowerCase());
+    }
+    // ICU collation support

Review Comment:
   since passing collationId to `matchAt` already implies that it's a collation-aware implementation, which then uses `semanticCompare` with a corresponding ICU comparator, I guess the comment makes no sense anyways. removing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508881751


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -586,9 +608,21 @@ object ContainsExpressionBuilder extends StringBinaryPredicateExpressionBuilderB
 }
 
 case class Contains(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.contains(r)
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationId: Int = left.dataType.asInstanceOf[StringType].collationId

Review Comment:
   makes sense, avoid duplicate eval



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508922010


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -183,6 +183,266 @@ class CollationSuite extends DatasourceV2SQLBase {
     }
   }
 
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left: String = "abc" // collate with 'UNICODE_CI'
+    val leftCollationName: String = "UNICODE_CI";
+    var right: String = null // collate with 'UNICODE'
+    val rightCollationName: String = "UNICODE";
+    // contains
+    right = left.substring(1, 2);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT contains(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()

Review Comment:
   ```suggestion
             s"collate('$right', '$rightCollationName'))")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1508983490


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -497,10 +497,33 @@ case class Lower(child: Expression)
 abstract class StringPredicate extends BinaryExpression
   with Predicate with ImplicitCastInputTypes with NullIntolerant {
 
+  final val collationId: Int = left.dataType.asInstanceOf[StringType].collationId
+
   def compare(l: UTF8String, r: UTF8String): Boolean
 
   override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val checkResult = super.checkInputDataTypes()
+    if (checkResult.isFailure) {
+      return checkResult
+    }
+    // Additional check needed for collation compatibility
+    val rightCollationId: Int = right.dataType.asInstanceOf[StringType].collationId
+    if (collationId != rightCollationId) {
+      throw new SparkException(

Review Comment:
   thanks, fixed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP] Collation support for built-in string functions: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1499001199


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -66,6 +66,13 @@ public static class Collation {
      */
     public final boolean isBinaryCollation;
 
+    /**
+     * Built-in string functions with collation support.

Review Comment:
   Sounds good, on it!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP] Collation support for built-in string functions: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1498991374


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,291 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+  test("Support contains string expression with Collation") {
+    // Test 'contains' with different collations
+    var listLeft: List[String] = List()
+    var listRight: List[String] = List()
+    var listResult: List[Boolean] = List()
+
+    // UCS_BASIC (default) & UNICODE collation
+    listLeft = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listRight = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, false, false, false, false, false,  //   c
+      true, true, true, false, false, false, false, false, false, false, false,   // abc
+      true, true, false, true, false, false, false, false, false, false, false,   //   cde
+      true, false, false, false, true, false, false, false, false, false, false,  // abde
+      true, true, true, true, false, true, false, false, false, false, false,     // abcde
+      true, false, false, false, false, false, true, false, false, false, false,  //   C
+      true, false, false, false, false, false, true, true, false, false, false,   // ABC
+      true, false, false, false, false, false, true, false, true, false, false,   //   CDE
+      true, false, false, false, false, false, false, false, false, true, false,  // ABDE
+      true, false, false, false, false, false, true, true, true, false, true)     // ABCDE
+    for {
+      (left, index_left) <- listLeft.zipWithIndex
+      (right, index_right) <- listRight.zipWithIndex
+    } {
+      val expectedAnswer = listResult(index_left * listRight.length + index_right)
+      // UCS_BASIC (default)
+      checkAnswer(sql("SELECT contains('" + left + "', '" + right + "')"), Row(expectedAnswer))
+      // UCS_BASIC
+      checkAnswer(sql("SELECT contains('" + left + "', collate('" +
+        right + "', 'UCS_BASIC'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT contains(collate('" + left + "', 'UCS_BASIC'), collate('" +
+        right + "', 'UCS_BASIC'))"), Row(expectedAnswer))
+      // UNICODE
+      checkAnswer(sql("SELECT contains('" + left + "', collate('" +
+        right + "', 'UNICODE'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT contains(collate('" + left + "', 'UNICODE'), collate('" +
+        right + "', 'UNICODE'))"), Row(expectedAnswer))
+    }
+
+
+    // UCS_BASIC_LCASE & UNICODE_CI collation
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, true, false, false, false, false,   //   c
+      true, true, true, false, false, false, true, true, false, false, false,     // abc
+      true, true, false, true, false, false, true, false, true, false, false,     //    cde
+      true, false, false, false, true, false, false, false, false, true, false,   // abde
+      true, true, true, true, false, true, true, true, true, false, true,         // abcde
+      true, true, false, false, false, false, true, false, false, false, false,   //   C
+      true, true, true, false, false, false, true, true, false, false, false,     // ABC
+      true, true, false, true, false, false, true, false, true, false, false,     //   CDE
+      true, false, false, false, true, false, false, false, false, true, false,   // ABDE
+      true, true, true, true, false, true, true, true, true, false, true)         // ABCDE
+    for {
+      (left, index_left) <- listLeft.zipWithIndex
+      (right, index_right) <- listRight.zipWithIndex
+    } {
+      val expectedAnswer = listResult(index_left * listRight.length + index_right)
+      // UCS_BASIC_LCASE
+      checkAnswer(sql("SELECT contains('" + left + "', collate('" +
+        right + "', 'UCS_BASIC_LCASE'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT contains(collate('" + left + "', 'UCS_BASIC_LCASE'), collate('" +
+        right + "', 'UCS_BASIC_LCASE'))"), Row(expectedAnswer))
+      // UNICODE_CI
+      checkAnswer(sql("SELECT contains('" + left + "', collate('" +
+        right + "', 'UNICODE_CI'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT contains(collate('" + left + "', 'UNICODE_CI'), collate('" +
+        right + "', 'UNICODE_CI'))"), Row(expectedAnswer))
+    }
+  }
+
+    test("Support startsWith string expression with Collation") {
+    // Test 'startsWith' with different collations
+    var listLeft: List[String] = List()
+    var listRight: List[String] = List()
+    var listResult: List[Boolean] = List()
+
+    // UCS_BASIC (default) & UNICODE collation
+    listLeft = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listRight = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, false, false, false, false, false,  //   c
+      true, false, true, false, false, false, false, false, false, false, false,  // abc
+      true, true, false, true, false, false, false, false, false, false, false,   //   cde
+      true, false, false, false, true, false, false, false, false, false, false,  // abde
+      true, false, true, false, false, true, false, false, false, false, false,   // abcde
+      true, false, false, false, false, false, true, false, false, false, false,  //   C
+      true, false, false, false, false, false, false, true, false, false, false,  // ABC
+      true, false, false, false, false, false, true, false, true, false, false,   //   CDE
+      true, false, false, false, false, false, false, false, false, true, false,  // ABDE
+      true, false, false, false, false, false, false, true, false, false, true)   // ABCDE
+    for {
+      (left, index_left) <- listLeft.zipWithIndex
+      (right, index_right) <- listRight.zipWithIndex
+    } {
+      val expectedAnswer = listResult(index_left * listRight.length + index_right)
+      // UCS_BASIC (default)
+      checkAnswer(sql("SELECT startswith('" + left + "', '" + right + "')"), Row(expectedAnswer))
+      // UCS_BASIC
+      checkAnswer(sql("SELECT startswith('" + left + "', collate('" +
+        right + "', 'UCS_BASIC'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT startswith(collate('" + left + "', 'UCS_BASIC'), collate('" +
+        right + "', 'UCS_BASIC'))"), Row(expectedAnswer))
+      // UNICODE
+      checkAnswer(sql("SELECT startswith('" + left + "', collate('" +
+        right + "', 'UNICODE'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT startswith(collate('" + left + "', 'UNICODE'), collate('" +
+        right + "', 'UNICODE'))"), Row(expectedAnswer))
+    }
+
+    // UCS_BASIC_LCASE & UNICODE_CI collation
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, true, false, false, false, false,   //   c
+      true, false, true, false, false, false, false, true, false, false, false,   // abc
+      true, true, false, true, false, false, true, false, true, false, false,     //   cde
+      true, false, false, false, true, false, false, false, false, true, false,   // abde
+      true, false, true, false, false, true, false, true, false, false, true,     // abcde
+      true, true, false, false, false, false, true, false, false, false, false,   //   C
+      true, false, true, false, false, false, false, true, false, false, false,   // ABC
+      true, true, false, true, false, false, true, false, true, false, false,     //   CDE
+      true, false, false, false, true, false, false, false, false, true, false,   // ABDE
+      true, false, true, false, false, true, false, true, false, false, true)     // ABCDE
+    for {
+      (left, index_left) <- listLeft.zipWithIndex
+      (right, index_right) <- listRight.zipWithIndex
+    } {
+      val expectedAnswer = listResult(index_left * listRight.length + index_right)
+      // UCS_BASIC_LCASE
+      checkAnswer(sql("SELECT startswith('" + left + "', collate('" +
+        right + "', 'UCS_BASIC_LCASE'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT startswith(collate('" + left + "', 'UCS_BASIC_LCASE'), collate('" +
+        right + "', 'UCS_BASIC_LCASE'))"), Row(expectedAnswer))
+      // UNICODE_CI
+      checkAnswer(sql("SELECT startswith('" + left + "', collate('" +
+        right + "', 'UNICODE_CI'))"), Row(expectedAnswer))
+      checkAnswer(sql("SELECT startswith(collate('" + left + "', 'UNICODE_CI'), collate('" +
+        right + "', 'UNICODE_CI'))"), Row(expectedAnswer))
+    }
+
+    // Serbian language collation tests

Review Comment:
   Guess that this can be removed :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1510831636


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -234,6 +438,7 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
         }.nonEmpty)
       }
     }
+

Review Comment:
   let's avoid unnecessary changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1505471123


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.containsBinary(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().containsBinary(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"
+    Map<String, String> params = new HashMap<>();
+    params.put("functionName", Thread.currentThread().getStackTrace()[1].getMethodName());
+    params.put("collationName", CollationFactory.fetchCollation(collationId).collationName);
+    throw new SparkException("COLLATION_NOT_SUPPORTED_FOR_FUNCTION",

Review Comment:
   I think the logic here was that there are gonna be some functions that are not (yet) supported for some particular collation types (some might get added at one point in the near future, while others may not), so we will eventually need a way to let users know. that's why we added the `COLLATION_NOT_SUPPORTED_FOR_FUNCTION` error class - while it will likely get removed from here, there are other places where it will appear instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1509076371


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -183,6 +183,266 @@ class CollationSuite extends DatasourceV2SQLBase {
     }
   }
 
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left: String = "abc" // collate with 'UNICODE_CI'
+    val leftCollationName: String = "UNICODE_CI";
+    var right: String = null // collate with 'UNICODE'
+    val rightCollationName: String = "UNICODE";
+    // contains
+    right = left.substring(1, 2);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT contains(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // startsWith
+    right = left.substring(0, 1);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT startsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // endsWith
+    right = left.substring(2, 3);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT endsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+  }
+
+  test("Support contains string expression with Collation") {
+    // Test 'contains' with different collations
+    var listLeft: List[String] = List()
+    var listRight: List[String] = List()
+    var listResult: List[Boolean] = List()
+
+    // UCS_BASIC (default) & UNICODE collation
+    listLeft = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listRight = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, false, false, false, false, false,  //   c
+      true, true, true, false, false, false, false, false, false, false, false,   // abc
+      true, true, false, true, false, false, false, false, false, false, false,   //   cde
+      true, false, false, false, true, false, false, false, false, false, false,  // abde
+      true, true, true, true, false, true, false, false, false, false, false,     // abcde
+      true, false, false, false, false, false, true, false, false, false, false,  //   C
+      true, false, false, false, false, false, true, true, false, false, false,   // ABC
+      true, false, false, false, false, false, true, false, true, false, false,   //   CDE
+      true, false, false, false, false, false, false, false, false, true, false,  // ABDE
+      true, false, false, false, false, false, true, true, true, false, true)     // ABCDE

Review Comment:
   while it may seem a bit unusual at first, I think this matrix approach covers a broad spectrum of test cases and generally works really well for this set of functions - covering various edge-cases and different collation types (this was especially useful when debugging and experimenting with new collations) ex. imagine throwing Serbian (ć, Ć) or German collations (ä, Ä) into the mix with other possible `abc`s
   
   when I first wrote it as a standard linear set of tests, it was much harder to see how and why these functions behave the way they do with different collations, while just looking a bit more into this nicely aligned matrix gives a pretty clear overiview all in one place (in addition, previously it was easy to miss something and not cover all cases, and also hard-coding the expected results was extra-tedious)



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -183,6 +183,266 @@ class CollationSuite extends DatasourceV2SQLBase {
     }
   }
 
+  test("checkCollation throws exception for incompatible collationIds") {
+    val left: String = "abc" // collate with 'UNICODE_CI'
+    val leftCollationName: String = "UNICODE_CI";
+    var right: String = null // collate with 'UNICODE'
+    val rightCollationName: String = "UNICODE";
+    // contains
+    right = left.substring(1, 2);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT contains(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // startsWith
+    right = left.substring(0, 1);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT startsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+    // endsWith
+    right = left.substring(2, 3);
+    checkError(
+      exception = intercept[SparkException] {
+        spark.sql(s"SELECT endsWith(collate('$left', '$leftCollationName')," +
+          s"collate('$right', '$rightCollationName'))").collect()
+      },
+      errorClass = "COLLATION_MISMATCH",
+      sqlState = "42K09",
+      parameters = Map(
+        "collationNameLeft" -> s"$leftCollationName",
+        "collationNameRight" -> s"$rightCollationName"
+      )
+    )
+  }
+
+  test("Support contains string expression with Collation") {
+    // Test 'contains' with different collations
+    var listLeft: List[String] = List()
+    var listRight: List[String] = List()
+    var listResult: List[Boolean] = List()
+
+    // UCS_BASIC (default) & UNICODE collation
+    listLeft = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listRight = List("", "c", "abc", "cde", "abde", "abcde", "C", "ABC", "CDE", "ABDE", "ABCDE")
+    listResult = List(
+    //  ""     c     abc    cde   abde  abcde    C     ABC    CDE    ABDE  ABCDE
+      true, false, false, false, false, false, false, false, false, false, false, //  ""
+      true, true, false, false, false, false, false, false, false, false, false,  //   c
+      true, true, true, false, false, false, false, false, false, false, false,   // abc
+      true, true, false, true, false, false, false, false, false, false, false,   //   cde
+      true, false, false, false, true, false, false, false, false, false, false,  // abde
+      true, true, true, true, false, true, false, false, false, false, false,     // abcde
+      true, false, false, false, false, false, true, false, false, false, false,  //   C
+      true, false, false, false, false, false, true, true, false, false, false,   // ABC
+      true, false, false, false, false, false, true, false, true, false, false,   //   CDE
+      true, false, false, false, false, false, false, false, false, true, false,  // ABDE
+      true, false, false, false, false, false, true, true, true, false, true)     // ABCDE

Review Comment:
   while it may seem a bit unusual at first, I think this matrix approach covers a broad spectrum of test cases and generally works really well for this set of functions - covering various edge-cases and different collation types (this was especially useful when debugging and experimenting with new collations) ex. imagine throwing Serbian (ć, Ć) or German collations (ä, Ä) into the mix with other possible `abc`s
   
   when I first wrote it as a standard linear set of tests, it was much harder to see how and why these functions behave the way they do with different collations, while just looking a bit more into this nicely aligned matrix gives a pretty clear overview all in one place (in addition, previously it was easy to miss something and not cover all cases, and also hard-coding the expected results was extra-tedious)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1506046182


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.contains(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().contains(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"
+    Map<String, String> params = new HashMap<>();
+    params.put("functionName", Thread.currentThread().getStackTrace()[1].getMethodName());

Review Comment:
   We should use the SQL function name, which is `contains`. It doesn't matter what the internal java function name is to end users.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1506048114


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -355,14 +371,43 @@ public boolean matchAt(final UTF8String s, int pos) {
     return ByteArrayMethods.arrayEquals(base, offset + pos, s.base, s.offset, s.numBytes);
   }
 
+  private boolean matchAt(final UTF8String s, int pos, int collationId) {
+    if (s.numBytes + pos > numBytes || pos < 0) {
+      return false;
+    }
+    return this.substring(pos, pos + s.numBytes).semanticCompare(s, collationId) == 0;
+  }
+
   public boolean startsWith(final UTF8String prefix) {
     return matchAt(prefix, 0);
   }
 
+  public boolean startsWith(final UTF8String prefix, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.startsWith(prefix);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().startsWith(prefix.toLowerCase());
+    }
+    // ICU collation support

Review Comment:
   is this a TODO?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1505447421


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -327,11 +329,10 @@ public UTF8String substringSQL(int pos, int length) {
   /**
    * Returns whether this contains `substring` or not.
    */
-  public boolean contains(final UTF8String substring) {
+  public boolean containsBinary(final UTF8String substring) {

Review Comment:
   similar to `equals`/`compareTo`, let's not rename existing functions for backward compatibility, just add comments to say it's the binary version.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1503794927


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -559,6 +557,19 @@ case class BinaryPredicate(override val prettyName: String, left: Expression, ri
   }
 }
 
+trait CollationCheck {
+  self: BinaryExpression =>
+  def checkCollationIds(): Int = {
+    val leftCollationId: Int = left.dataType.asInstanceOf[StringType].collationId
+    val rightCollationId: Int = right.dataType.asInstanceOf[StringType].collationId
+    if (leftCollationId != rightCollationId) {
+      throw new IllegalArgumentException(
+        "Function requires the same collation type for left and right strings.")

Review Comment:
   Should be taken care of with new error classes (latest commit)



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -559,6 +557,19 @@ case class BinaryPredicate(override val prettyName: String, left: Expression, ri
   }
 }
 
+trait CollationCheck {
+  self: BinaryExpression =>

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1505450164


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -341,6 +342,21 @@ public boolean contains(final UTF8String substring) {
     return false;
   }
 
+  public boolean contains(final UTF8String substring, int collationId) throws SparkException {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.containsBinary(substring);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return this.toLowerCase().containsBinary(substring.toLowerCase());
+    }
+    // TODO: enable ICU collation support for "contains"
+    Map<String, String> params = new HashMap<>();
+    params.put("functionName", Thread.currentThread().getStackTrace()[1].getMethodName());
+    params.put("collationName", CollationFactory.fetchCollation(collationId).collationName);
+    throw new SparkException("COLLATION_NOT_SUPPORTED_FOR_FUNCTION",

Review Comment:
   yes, will implement "contains" for non-binary non-lowercase collations in a future PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1510831636


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -234,6 +438,7 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
         }.nonEmpty)
       }
     }
+

Review Comment:
   let's avoid unnecessary changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1510763028


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -497,10 +497,32 @@ case class Lower(child: Expression)
 abstract class StringPredicate extends BinaryExpression
   with Predicate with ImplicitCastInputTypes with NullIntolerant {
 
+  final val collationId: Int = left.dataType.asInstanceOf[StringType].collationId

Review Comment:
   makes sense! fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45216:
URL: https://github.com/apache/spark/pull/45216#issuecomment-1976739454

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502795006


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -622,10 +646,24 @@ object StartsWithExpressionBuilder extends StringBinaryPredicateExpressionBuilde
   }
 }
 
-case class StartsWith(left: Expression, right: Expression) extends StringPredicate {
-  override def compare(l: UTF8String, r: UTF8String): Boolean = l.startsWith(r)
+case class StartsWith(left: Expression, right: Expression) extends StringPredicate
+  with CollationCheck {
+  override def compare(l: UTF8String, r: UTF8String): Boolean = {
+    val collationId = checkCollationIds()
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      l.startsWith(r)

Review Comment:
   You're right. "startsWith" was the original name for the binary method. Now, it would probably be best to change it to "startsWithBinary" (and probably "startsWithNonBinary" for the other one)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1502776606


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -559,6 +557,19 @@ case class BinaryPredicate(override val prettyName: String, left: Expression, ri
   }
 }
 
+trait CollationCheck {
+  self: BinaryExpression =>
+  def checkCollationIds(): Int = {
+    val leftCollationId: Int = left.dataType.asInstanceOf[StringType].collationId
+    val rightCollationId: Int = right.dataType.asInstanceOf[StringType].collationId
+    if (leftCollationId != rightCollationId) {
+      throw new IllegalArgumentException(
+        "Function requires the same collation type for left and right strings.")

Review Comment:
   Add a bit more information here. E.g. left was = ..., right was ....



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org