You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "miland-db (via GitHub)" <gi...@apache.org> on 2024/03/26 15:09:48 UTC

[PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

miland-db opened a new pull request, #45725:
URL: https://github.com/apache/spark/pull/45725

   ### What changes were proposed in this pull request?
   Extend built-in string functions to support non-binary, non-lowercase collation for: substring_index.
   
   ### Why are the changes needed?
   Update collation support for built-in string functions in Spark.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, users should now be able to use COLLATE within arguments for built-in string function SUBSTRING_INDEX in Spark SQL queries, using non-binary collations such as UNICODE_CI.
   
   ### How was this patch tested?
   Unit tests for queries using StringReplace (`CollationStringExpressionsSuite.scala`).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   
   ### Info:
   There is no check for collation match between string and delimiter, it will be introduced with Implicit Casting.
   
   We can remove the original `public UTF8String subStringIndex(UTF8String delim, int count)` method, and get the existing behavior using `subStringIndex(delim, count, 0)`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544722450


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -913,6 +1028,57 @@ public UTF8String subStringIndex(UTF8String delim, int count) {
     }
   }
 
+  public UTF8String subStringIndex(UTF8String delim, int count, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {

Review Comment:
   aren't we already checking for this when we do codegen? This way we're losing the benefit of not having to worry about branching for the codegen path



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.

miland-db commented on PR #45725:
URL: https://github.com/apache/spark/pull/45725#issuecomment-2033959175

   Adding: @cloud-fan please review this change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.

miland-db commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1546055279


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -913,6 +1028,57 @@ public UTF8String subStringIndex(UTF8String delim, int count) {
     }
   }
 
+  public UTF8String subStringIndex(UTF8String delim, int count, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {

Review Comment:
   Sure, for this function we can do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544714282


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -863,6 +914,70 @@ private int rfind(UTF8String str, int start) {
     return -1;
   }
 
+  /**
+   * Find the `str` from right to left considering different collations.
+   */
+  private int rfind(UTF8String str, int start, int collationId) {

Review Comment:
   did you think about maybe not overloading these functions but instead having a default value or optional parameter for collationId?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544730418


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -73,6 +73,84 @@ class CollationStringExpressionsSuite extends QueryTest
     })
   }
 
+  test("SUBSTRING_INDEX check result on explicitly collated strings") {
+    def testSubstringIndex(str: String, delim: String, cnt: Integer,
+                           collationId: Integer, expected: String): Unit = {
+      val string = Literal.create(str, StringType(collationId))
+      val delimiter = Literal.create(delim, StringType(collationId))
+      val count = Literal(cnt)
+
+      checkEvaluation(SubstringIndex(string, delimiter, count), expected)
+    }
+
+    testSubstringIndex("wwwgapachegorg", "g", -3, 0, "apachegorg")
+    testSubstringIndex("www||apache||org", "||", 2, 0, "www||apache")
+    // UTF8_BINARY_LCASE

Review Comment:
   instead of separating these blocks with comments describing the collation used how about using a variable denoting the current collation
   ```
   var collationId = CollationFactory.nameToId("UTF8_BINARY")
   testcases ...
   
   collationId = CollationFactory.nameToId("UNICODE")
   testcases ...
   
   ```
   
   or even we can do it separately for each
   ```
   val unicodeId = CollationFacotry.nameToId(...) 
   ```
   
   this way we won't have to litter the code with [magic constants](https://en.wikipedia.org/wiki/Magic_number_(programming)) which should make it more readable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.

miland-db commented on PR #45725:
URL: https://github.com/apache/spark/pull/45725#issuecomment-2020985292

   @uros-db @mihailom-db @MaxGekk please take a look at this changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #45725: [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings
URL: https://github.com/apache/spark/pull/45725


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544723643


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -913,6 +1028,57 @@ public UTF8String subStringIndex(UTF8String delim, int count) {
     }
   }
 
+  public UTF8String subStringIndex(UTF8String delim, int count, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {

Review Comment:
   at least we can directly call `collatedSubStringIndex` in codegen



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544730418


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -73,6 +73,84 @@ class CollationStringExpressionsSuite extends QueryTest
     })
   }
 
+  test("SUBSTRING_INDEX check result on explicitly collated strings") {
+    def testSubstringIndex(str: String, delim: String, cnt: Integer,
+                           collationId: Integer, expected: String): Unit = {
+      val string = Literal.create(str, StringType(collationId))
+      val delimiter = Literal.create(delim, StringType(collationId))
+      val count = Literal(cnt)
+
+      checkEvaluation(SubstringIndex(string, delimiter, count), expected)
+    }
+
+    testSubstringIndex("wwwgapachegorg", "g", -3, 0, "apachegorg")
+    testSubstringIndex("www||apache||org", "||", 2, 0, "www||apache")
+    // UTF8_BINARY_LCASE

Review Comment:
   instead of separating these blocks with comments describing the collation used how about using a variable denoting the current collation
   ```
   var collationId = CollationFactory.nameToId("UTF8_BINARY")
   testcases ...
   
   collationId = CollationFactory.nameToId("UNICODE")
   testcases ...
   
   ```
   
   this way we won't have to litter the code with [magic constants](https://en.wikipedia.org/wiki/Magic_number_(programming)) which should make it more readable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.

miland-db commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1584430498


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##########
@@ -441,6 +444,45 @@ public static int execICU(final UTF8String string, final UTF8String substring, f
     }
   }
 
+  public static class SubstringIndex {
+    public static UTF8String exec(final UTF8String string, final UTF8String delimiter,
+        final int count, final int collationId) {
+      CollationFactory.Collation collation = CollationFactory.fetchCollation(collationId);
+      if (collation.supportsBinaryEquality) {
+        return execBinary(string, delimiter, count);
+      } else if (collation.supportsLowercaseEquality) {
+        return execLowercase(string, delimiter, count);
+      } else {
+        return execICU(string, delimiter, count, collationId);
+      }
+    }
+    public static String genCode(final String string, final String delimiter,
+        final int count, final int collationId) {
+      CollationFactory.Collation collation = CollationFactory.fetchCollation(collationId);
+      String expr = "CollationSupport.SubstringIndex.exec";
+      if (collation.supportsBinaryEquality) {
+        return String.format(expr + "Binary(%s, %s, %d)", string, delimiter, count);
+      } else if (collation.supportsLowercaseEquality) {
+        return String.format(expr + "Lowercase(%s, %s, %d)", string, delimiter, count);
+      } else {
+        return String.format(expr + "ICU(%s, %s, %d, %d)", string, delimiter, count, collationId);
+      }
+    }
+    public static UTF8String execBinary(final UTF8String string, final UTF8String delimiter,
+        final int count) {
+      return string.subStringIndex(delimiter, count);
+    }
+    public static UTF8String execLowercase(final UTF8String string, final UTF8String delimiter,
+        final int count) {
+      return CollationAwareUTF8String.lowercaseSubStringIndex(string, delimiter, count);

Review Comment:
   Maybe in the next PR. We will consider this option



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.

miland-db commented on PR #45725:
URL: https://github.com/apache/spark/pull/45725#issuecomment-2020856340

   - Using this function without explicit collations should have the same performance as before. 
   - For UTF8_BINARY_LCASE collation, performance should have the same _asymptotic time complexity_ but it will take a bit more time for converting strings to lowercase.
   - Performance for other non-binary collations depends on StringSearch implementation, but it is widely used to do string search on collated strings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544732383


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -73,6 +73,84 @@ class CollationStringExpressionsSuite extends QueryTest
     })
   }
 
+  test("SUBSTRING_INDEX check result on explicitly collated strings") {
+    def testSubstringIndex(str: String, delim: String, cnt: Integer,
+                           collationId: Integer, expected: String): Unit = {
+      val string = Literal.create(str, StringType(collationId))
+      val delimiter = Literal.create(delim, StringType(collationId))
+      val count = Literal(cnt)
+
+      checkEvaluation(SubstringIndex(string, delimiter, count), expected)
+    }
+
+    testSubstringIndex("wwwgapachegorg", "g", -3, 0, "apachegorg")
+    testSubstringIndex("www||apache||org", "||", 2, 0, "www||apache")
+    // UTF8_BINARY_LCASE
+    testSubstringIndex("AaAaAaAaAa", "aa", 2, 1, "A")
+    testSubstringIndex("www.apache.org", ".", 3, 1, "www.apache.org")
+    testSubstringIndex("wwwXapachexorg", "x", 2, 1, "wwwXapache")
+    testSubstringIndex("wwwxapacheXorg", "X", 1, 1, "www")
+    testSubstringIndex("www.apache.org", ".", 0, 1, "")
+    testSubstringIndex("www.apache.ORG", ".", -3, 1, "www.apache.ORG")
+    testSubstringIndex("wwwGapacheGorg", "g", 1, 1, "www")
+    testSubstringIndex("wwwGapacheGorg", "g", 3, 1, "wwwGapacheGor")
+    testSubstringIndex("gwwwGapacheGorg", "g", 3, 1, "gwwwGapache")
+    testSubstringIndex("wwwGapacheGorg", "g", -3, 1, "apacheGorg")
+    testSubstringIndex("wwwmapacheMorg", "M", -2, 1, "apacheMorg")
+    testSubstringIndex("www.apache.org", ".", -1, 1, "org")
+    testSubstringIndex("", ".", -2, 1, "")
+    // scalastyle:off
+    testSubstringIndex("test大千世界X大千世界", "x", -1, 1, "大千世界")
+    testSubstringIndex("test大千世界X大千世界", "X", 1, 1, "test大千世界")
+    testSubstringIndex("test大千世界大千世界", "千", 2, 1, "test大千世界大")
+    // scalastyle:on
+    testSubstringIndex("www||APACHE||org", "||", 2, 1, "www||APACHE")
+    testSubstringIndex("www||APACHE||org", "||", -1, 1, "org")
+    // UNICODE
+    testSubstringIndex("AaAaAaAaAa", "Aa", 2, 2, "Aa")
+    testSubstringIndex("wwwYapacheyorg", "y", 3, 2, "wwwYapacheyorg")
+    testSubstringIndex("www.apache.org", ".", 2, 2, "www.apache")

Review Comment:
   accidentally left this under UNICODE tests - but this only applies for UNICODE_CI and UTF8_BINARY_LCASE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.

miland-db commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1546053240


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -863,6 +914,70 @@ private int rfind(UTF8String str, int start) {
     return -1;
   }
 
+  /**
+   * Find the `str` from right to left considering different collations.
+   */
+  private int rfind(UTF8String str, int start, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return this.rfind(str, start);
+    }
+    if(collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {
+      return rfindLowercase(str, start);
+    }
+    return collatedRFind(str, start, collationId);
+  }
+
+  /**
+   * Find the `str` from left to right considering binary lowercase collation.
+   */
+  private int rfindLowercase(UTF8String str, int start) {

Review Comment:
   Sure, but I think that we still don't have real case insensitive support. When we implement this kind of collator, we can change this code later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544714282


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -863,6 +914,70 @@ private int rfind(UTF8String str, int start) {
     return -1;
   }
 
+  /**
+   * Find the `str` from right to left considering different collations.
+   */
+  private int rfind(UTF8String str, int start, int collationId) {

Review Comment:
   did you think about maybe not overloading these functions but instead having a default value or optional parameter for collationId?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544719812


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -863,6 +914,70 @@ private int rfind(UTF8String str, int start) {
     return -1;
   }
 
+  /**
+   * Find the `str` from right to left considering different collations.
+   */
+  private int rfind(UTF8String str, int start, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return this.rfind(str, start);
+    }
+    if(collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {
+      return rfindLowercase(str, start);
+    }
+    return collatedRFind(str, start, collationId);
+  }
+
+  /**
+   * Find the `str` from left to right considering binary lowercase collation.
+   */
+  private int rfindLowercase(UTF8String str, int start) {
+    if(numBytes == 0 || str.numBytes == 0) {
+      return -1;
+    }
+
+    UTF8String lowercaseThis = this.toLowerCase();
+    UTF8String lowercaseStr = str.toLowerCase();
+
+    int prevStart = -1;
+    int matchStart = lowercaseThis.indexOf(lowercaseStr, 0);
+    while(charPosToByte(matchStart) <= start) {
+      if(matchStart != -1) {
+        // Found a match, update the start position
+        prevStart = matchStart;
+        matchStart = lowercaseThis.indexOf(lowercaseStr, matchStart + 1);
+      } else {
+        return charPosToByte(prevStart);
+      }
+    }
+
+    return charPosToByte(prevStart);
+  }
+
+  /**
+   * Find the `str` from left to right considering non-binary collations.
+   */
+  private int collatedRFind(UTF8String str, int start, int collationId) {
+    if(numBytes == 0 || str.numBytes == 0) {

Review Comment:
   same questions as above for `str.numBytes == 0`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544731317


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -73,6 +73,84 @@ class CollationStringExpressionsSuite extends QueryTest
     })
   }
 
+  test("SUBSTRING_INDEX check result on explicitly collated strings") {
+    def testSubstringIndex(str: String, delim: String, cnt: Integer,
+                           collationId: Integer, expected: String): Unit = {
+      val string = Literal.create(str, StringType(collationId))
+      val delimiter = Literal.create(delim, StringType(collationId))
+      val count = Literal(cnt)
+
+      checkEvaluation(SubstringIndex(string, delimiter, count), expected)
+    }
+
+    testSubstringIndex("wwwgapachegorg", "g", -3, 0, "apachegorg")
+    testSubstringIndex("www||apache||org", "||", 2, 0, "www||apache")
+    // UTF8_BINARY_LCASE
+    testSubstringIndex("AaAaAaAaAa", "aa", 2, 1, "A")
+    testSubstringIndex("www.apache.org", ".", 3, 1, "www.apache.org")
+    testSubstringIndex("wwwXapachexorg", "x", 2, 1, "wwwXapache")
+    testSubstringIndex("wwwxapacheXorg", "X", 1, 1, "www")
+    testSubstringIndex("www.apache.org", ".", 0, 1, "")
+    testSubstringIndex("www.apache.ORG", ".", -3, 1, "www.apache.ORG")
+    testSubstringIndex("wwwGapacheGorg", "g", 1, 1, "www")
+    testSubstringIndex("wwwGapacheGorg", "g", 3, 1, "wwwGapacheGor")
+    testSubstringIndex("gwwwGapacheGorg", "g", 3, 1, "gwwwGapache")
+    testSubstringIndex("wwwGapacheGorg", "g", -3, 1, "apacheGorg")
+    testSubstringIndex("wwwmapacheMorg", "M", -2, 1, "apacheMorg")
+    testSubstringIndex("www.apache.org", ".", -1, 1, "org")
+    testSubstringIndex("", ".", -2, 1, "")
+    // scalastyle:off
+    testSubstringIndex("test大千世界X大千世界", "x", -1, 1, "大千世界")
+    testSubstringIndex("test大千世界X大千世界", "X", 1, 1, "test大千世界")
+    testSubstringIndex("test大千世界大千世界", "千", 2, 1, "test大千世界大")
+    // scalastyle:on
+    testSubstringIndex("www||APACHE||org", "||", 2, 1, "www||APACHE")
+    testSubstringIndex("www||APACHE||org", "||", -1, 1, "org")
+    // UNICODE
+    testSubstringIndex("AaAaAaAaAa", "Aa", 2, 2, "Aa")
+    testSubstringIndex("wwwYapacheyorg", "y", 3, 2, "wwwYapacheyorg")
+    testSubstringIndex("www.apache.org", ".", 2, 2, "www.apache")

Review Comment:
   almost all of these have exact case match for both string - maybe randomize it a bit for case insensitive collations?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1584428501


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##########
@@ -441,6 +444,45 @@ public static int execICU(final UTF8String string, final UTF8String substring, f
     }
   }
 
+  public static class SubstringIndex {
+    public static UTF8String exec(final UTF8String string, final UTF8String delimiter,
+        final int count, final int collationId) {
+      CollationFactory.Collation collation = CollationFactory.fetchCollation(collationId);
+      if (collation.supportsBinaryEquality) {
+        return execBinary(string, delimiter, count);
+      } else if (collation.supportsLowercaseEquality) {
+        return execLowercase(string, delimiter, count);
+      } else {
+        return execICU(string, delimiter, count, collationId);
+      }
+    }
+    public static String genCode(final String string, final String delimiter,
+        final int count, final int collationId) {
+      CollationFactory.Collation collation = CollationFactory.fetchCollation(collationId);
+      String expr = "CollationSupport.SubstringIndex.exec";
+      if (collation.supportsBinaryEquality) {
+        return String.format(expr + "Binary(%s, %s, %d)", string, delimiter, count);
+      } else if (collation.supportsLowercaseEquality) {
+        return String.format(expr + "Lowercase(%s, %s, %d)", string, delimiter, count);
+      } else {
+        return String.format(expr + "ICU(%s, %s, %d, %d)", string, delimiter, count, collationId);
+      }
+    }
+    public static UTF8String execBinary(final UTF8String string, final UTF8String delimiter,
+        final int count) {
+      return string.subStringIndex(delimiter, count);
+    }
+    public static UTF8String execLowercase(final UTF8String string, final UTF8String delimiter,
+        final int count) {
+      return CollationAwareUTF8String.lowercaseSubStringIndex(string, delimiter, count);

Review Comment:
   The class `CollationAwareUTF8String` is getting bigger. Shall we move it to an individual file?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.

miland-db commented on PR #45725:
URL: https://github.com/apache/spark/pull/45725#issuecomment-2022319005

   I am testing functions from `stringExpressions`. Existing tests for these functions on non-collated strings are written in `StringExpressionsSuite`. Following that logic test on collated strings using functions from `stringExpressions` should be in `CollationStringExpressionsSuite`. I can add unit tests for changes introduced in `UTF8String` to `UTF8StringWithCollationSuite.java` if that's what we want to test that part more thoroughly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544716041


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -863,6 +914,70 @@ private int rfind(UTF8String str, int start) {
     return -1;
   }
 
+  /**
+   * Find the `str` from right to left considering different collations.
+   */
+  private int rfind(UTF8String str, int start, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return this.rfind(str, start);
+    }
+    if(collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {
+      return rfindLowercase(str, start);
+    }
+    return collatedRFind(str, start, collationId);
+  }
+
+  /**
+   * Find the `str` from left to right considering binary lowercase collation.
+   */
+  private int rfindLowercase(UTF8String str, int start) {
+    if(numBytes == 0 || str.numBytes == 0) {
+      return -1;
+    }
+
+    UTF8String lowercaseThis = this.toLowerCase();
+    UTF8String lowercaseStr = str.toLowerCase();
+
+    int prevStart = -1;
+    int matchStart = lowercaseThis.indexOf(lowercaseStr, 0);
+    while(charPosToByte(matchStart) <= start) {
+      if(matchStart != -1) {
+        // Found a match, update the start position
+        prevStart = matchStart;
+        matchStart = lowercaseThis.indexOf(lowercaseStr, matchStart + 1);
+      } else {
+        return charPosToByte(prevStart);
+      }
+    }
+
+    return charPosToByte(prevStart);
+  }
+
+  /**
+   * Find the `str` from left to right considering non-binary collations.
+   */
+  private int collatedRFind(UTF8String str, int start, int collationId) {

Review Comment:
   similar to the other comment regarding fn naming i think `collationAwareRFind` would be better



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.

miland-db commented on PR #45725:
URL: https://github.com/apache/spark/pull/45725#issuecomment-2027484710

   @stefankandic can you also review this change please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #45725:
URL: https://github.com/apache/spark/pull/45725#issuecomment-2084801665

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1549421652


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -913,6 +998,50 @@ public UTF8String subStringIndex(UTF8String delim, int count) {
     }
   }
 
+  public UTF8String collationAwareSubStringIndex(UTF8String delim, int count, int collationId) {

Review Comment:
   I think we now have a lot of new methods 
   consider adding comments such as: https://github.com/apache/spark/pull/45791/files#r1549415554



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544713454


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -863,6 +914,70 @@ private int rfind(UTF8String str, int start) {
     return -1;
   }
 
+  /**
+   * Find the `str` from right to left considering different collations.
+   */
+  private int rfind(UTF8String str, int start, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return this.rfind(str, start);
+    }
+    if(collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {
+      return rfindLowercase(str, start);
+    }
+    return collatedRFind(str, start, collationId);
+  }
+
+  /**
+   * Find the `str` from left to right considering binary lowercase collation.
+   */
+  private int rfindLowercase(UTF8String str, int start) {
+    if(numBytes == 0 || str.numBytes == 0) {

Review Comment:
   this seems to have a different behaviour compared to regular rfind which would fail if the number of bytes is zero, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on PR #45725:
URL: https://github.com/apache/spark/pull/45725#issuecomment-2049812986

   heads up: we’ve done some major code restructuring in https://github.com/apache/spark/pull/45978, so please sync these changes before moving on
   
   @miland-db you’ll likely need to rewrite the code in this PR, so please follow the guidelines outlined in https://issues.apache.org/jira/browse/SPARK-47410


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1565318876


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##########
@@ -47,7 +47,7 @@ object CollationTypeCasts extends TypeCoercionRule {
 
     case otherExpr @ (
       _: In | _: InSubquery | _: CreateArray | _: ArrayJoin | _: Concat | _: Greatest | _: Least |
-      _: Coalesce | _: BinaryExpression | _: ConcatWs) =>
+      _: Coalesce | _: BinaryExpression | _: ConcatWs | _: SubstringIndex) =>

Review Comment:
   Please special case it, not to take IntegerType into account.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1546515387


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -913,6 +1028,50 @@ public UTF8String subStringIndex(UTF8String delim, int count) {
     }
   }
 
+  public UTF8String collatedSubStringIndex(UTF8String delim, int count, int collationId) {

Review Comment:
   Let's go with this naming: `collationAwareSubStringIndex`
   
   ref: https://github.com/apache/spark/pull/45749#discussion_r1546421174



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544715571


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -863,6 +914,70 @@ private int rfind(UTF8String str, int start) {
     return -1;
   }
 
+  /**
+   * Find the `str` from right to left considering different collations.
+   */
+  private int rfind(UTF8String str, int start, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return this.rfind(str, start);
+    }
+    if(collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {
+      return rfindLowercase(str, start);
+    }
+    return collatedRFind(str, start, collationId);
+  }
+
+  /**
+   * Find the `str` from left to right considering binary lowercase collation.
+   */
+  private int rfindLowercase(UTF8String str, int start) {

Review Comment:
   i know that the collation is called UTF8_BINARY_LCASE but I'm not the biggest fan of this name as it somehow implies that it will only search for lowercase values, how about `rfindCaseInsensitive` or `caseInsensitiveRFind` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544730418


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -73,6 +73,84 @@ class CollationStringExpressionsSuite extends QueryTest
     })
   }
 
+  test("SUBSTRING_INDEX check result on explicitly collated strings") {
+    def testSubstringIndex(str: String, delim: String, cnt: Integer,
+                           collationId: Integer, expected: String): Unit = {
+      val string = Literal.create(str, StringType(collationId))
+      val delimiter = Literal.create(delim, StringType(collationId))
+      val count = Literal(cnt)
+
+      checkEvaluation(SubstringIndex(string, delimiter, count), expected)
+    }
+
+    testSubstringIndex("wwwgapachegorg", "g", -3, 0, "apachegorg")
+    testSubstringIndex("www||apache||org", "||", 2, 0, "www||apache")
+    // UTF8_BINARY_LCASE

Review Comment:
   instead of separating these blocks with comments describing the collation used how about using a variable denoting the current collation
   ```
   var collationId = CollationFactory.nameToId("UTF8_BINARY")
   testcases ...
   
   collationId = CollationFactory.nameToId("UNICODE")
   testcases ...
   
   ```
   
   this way we won't have to litter the code with 'magic constants' which should make it more readable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1584432496


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##########
@@ -441,6 +444,45 @@ public static int execICU(final UTF8String string, final UTF8String substring, f
     }
   }
 
+  public static class SubstringIndex {
+    public static UTF8String exec(final UTF8String string, final UTF8String delimiter,
+        final int count, final int collationId) {
+      CollationFactory.Collation collation = CollationFactory.fetchCollation(collationId);
+      if (collation.supportsBinaryEquality) {
+        return execBinary(string, delimiter, count);
+      } else if (collation.supportsLowercaseEquality) {
+        return execLowercase(string, delimiter, count);
+      } else {
+        return execICU(string, delimiter, count, collationId);
+      }
+    }
+    public static String genCode(final String string, final String delimiter,
+        final int count, final int collationId) {
+      CollationFactory.Collation collation = CollationFactory.fetchCollation(collationId);
+      String expr = "CollationSupport.SubstringIndex.exec";
+      if (collation.supportsBinaryEquality) {
+        return String.format(expr + "Binary(%s, %s, %d)", string, delimiter, count);
+      } else if (collation.supportsLowercaseEquality) {
+        return String.format(expr + "Lowercase(%s, %s, %d)", string, delimiter, count);
+      } else {
+        return String.format(expr + "ICU(%s, %s, %d, %d)", string, delimiter, count, collationId);
+      }
+    }
+    public static UTF8String execBinary(final UTF8String string, final UTF8String delimiter,
+        final int count) {
+      return string.subStringIndex(delimiter, count);
+    }
+    public static UTF8String execLowercase(final UTF8String string, final UTF8String delimiter,
+        final int count) {
+      return CollationAwareUTF8String.lowercaseSubStringIndex(string, delimiter, count);

Review Comment:
   agreed, we should do this in https://github.com/apache/spark/pull/45820



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544730418


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -73,6 +73,84 @@ class CollationStringExpressionsSuite extends QueryTest
     })
   }
 
+  test("SUBSTRING_INDEX check result on explicitly collated strings") {
+    def testSubstringIndex(str: String, delim: String, cnt: Integer,
+                           collationId: Integer, expected: String): Unit = {
+      val string = Literal.create(str, StringType(collationId))
+      val delimiter = Literal.create(delim, StringType(collationId))
+      val count = Literal(cnt)
+
+      checkEvaluation(SubstringIndex(string, delimiter, count), expected)
+    }
+
+    testSubstringIndex("wwwgapachegorg", "g", -3, 0, "apachegorg")
+    testSubstringIndex("www||apache||org", "||", 2, 0, "www||apache")
+    // UTF8_BINARY_LCASE

Review Comment:
   instead of separating these blocks with comments describing the collation used how about using a variable denoting the current collation
   ```
   var collationId = CollationFactory.nameToId("UTF8_BINARY")
   testcases ...
   
   collationId = CollationFactory.nameToId("UNICODE")
   testcases ...
   
   ```
   
   or even we can do it separately for each
   ```
   val utfLcaseId = CollationFacotry.nameToId(...) 
   ```
   
   this way we won't have to litter the code with [magic constants](https://en.wikipedia.org/wiki/Magic_number_(programming)) which should make it more readable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1544734286


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -73,6 +73,84 @@ class CollationStringExpressionsSuite extends QueryTest
     })
   }
 
+  test("SUBSTRING_INDEX check result on explicitly collated strings") {
+    def testSubstringIndex(str: String, delim: String, cnt: Integer,
+                           collationId: Integer, expected: String): Unit = {
+      val string = Literal.create(str, StringType(collationId))
+      val delimiter = Literal.create(delim, StringType(collationId))
+      val count = Literal(cnt)
+
+      checkEvaluation(SubstringIndex(string, delimiter, count), expected)
+    }
+
+    testSubstringIndex("wwwgapachegorg", "g", -3, 0, "apachegorg")
+    testSubstringIndex("www||apache||org", "||", 2, 0, "www||apache")
+    // UTF8_BINARY_LCASE

Review Comment:
   also we shouldn't treat these values as never changing and should always refer to collations by their name rather than their id



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1546515387


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -913,6 +1028,50 @@ public UTF8String subStringIndex(UTF8String delim, int count) {
     }
   }
 
+  public UTF8String collatedSubStringIndex(UTF8String delim, int count, int collationId) {

Review Comment:
   Let's go with this naming: `collationAwareSubStringIndex` for these private functions that take a collationId
   
   ref: https://github.com/apache/spark/pull/45749#discussion_r1546421174



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1549692065


##########
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringWithCollationSuite.java:
##########
@@ -19,7 +19,6 @@
 import org.apache.spark.SparkException;
 import org.apache.spark.sql.catalyst.util.CollationFactory;
 import org.junit.jupiter.api.Test;
-

Review Comment:
   remove this unrelated change please



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.

miland-db commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1546059961


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -863,6 +914,70 @@ private int rfind(UTF8String str, int start) {
     return -1;
   }
 
+  /**
+   * Find the `str` from right to left considering different collations.
+   */
+  private int rfind(UTF8String str, int start, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return this.rfind(str, start);
+    }
+    if(collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {
+      return rfindLowercase(str, start);
+    }
+    return collatedRFind(str, start, collationId);
+  }
+
+  /**
+   * Find the `str` from left to right considering binary lowercase collation.
+   */
+  private int rfindLowercase(UTF8String str, int start) {
+    if(numBytes == 0 || str.numBytes == 0) {

Review Comment:
   Changed to original behavior to fail if str.numBytes == 0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org