You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "miland-db (via GitHub)" <gi...@apache.org> on 2024/03/25 15:00:12 UTC

[PR] [SPARK-47476][SQL][COLLATION] String function support: replace [spark]

miland-db opened a new pull request, #45704:
URL: https://github.com/apache/spark/pull/45704

   ### What changes were proposed in this pull request?
   Extend built-in string functions to support non-binary, non-lowercase collation for: replace.
   
   ### Why are the changes needed?
   Update collation support for built-in string functions in Spark.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, users should now be able to use COLLATE within arguments for built-in string function REPLACE in Spark SQL queries, using non-binary collations such as UNICODE_CI.
   
   ### How was this patch tested?
   Unit tests for queries using StringReplace (`CollationStringExpressionsSuite.scala`).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1546513659


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -179,12 +179,26 @@ public static StringSearch getStringSearch(
       final UTF8String left,
       final UTF8String right,
       final int collationId) {
+
+    if(collationId == UTF8_BINARY_LCASE_COLLATION_ID) {
+      return getStringSearchUTF8LCase(left, right);
+    }
+
     String pattern = right.toString();
     CharacterIterator target = new StringCharacterIterator(left.toString());
     Collator collator = CollationFactory.fetchCollation(collationId).collator;
     return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  private static StringSearch getStringSearchUTF8LCase(

Review Comment:
   I think we can name this: `getStringSearch`
   absence of collationId will indicate that no Collator is specified for the corresponding StringSearch instance



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45704:
URL: https://github.com/apache/spark/pull/45704#issuecomment-2079601554

   the Spark Connect test failure is flaky and unrelated here, I'm merging it to master, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1550983282


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -735,23 +735,43 @@ case class EndsWith(left: Expression, right: Expression) extends StringPredicate
 case class StringReplace(srcExpr: Expression, searchExpr: Expression, replaceExpr: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
+  final lazy val collationId: Int = first.dataType.asInstanceOf[StringType].collationId
+
   def this(srcExpr: Expression, searchExpr: Expression) = {
     this(srcExpr, searchExpr, Literal(""))
   }
 
   override def nullSafeEval(srcEval: Any, searchEval: Any, replaceEval: Any): Any = {
     srcEval.asInstanceOf[UTF8String].replace(
-      searchEval.asInstanceOf[UTF8String], replaceEval.asInstanceOf[UTF8String])
+      searchEval.asInstanceOf[UTF8String], replaceEval.asInstanceOf[UTF8String], collationId)
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-    nullSafeCodeGen(ctx, ev, (src, search, replace) => {
-      s"""${ev.value} = $src.replace($search, $replace);"""
-    })
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      nullSafeCodeGen(ctx, ev, (src, search, replace) => {
+        s"""${ev.value} = $src.replace($search, $replace);"""
+      })
+    } else {
+      nullSafeCodeGen(ctx, ev, (src, search, replace) => {
+        s"""${ev.value} = $src.replace($search, $replace, $collationId);"""
+      })
+    }
   }
 
-  override def dataType: DataType = StringType
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, StringType)
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val defaultCheck = super.checkInputDataTypes()
+    if (defaultCheck.isFailure) {
+      return defaultCheck
+    }
+
+    // Only srcExpr and searchExpr are checked for collation compatibility.
+    val collationId = first.dataType.asInstanceOf[StringType].collationId
+    CollationTypeConstraints.checkCollationCompatibility(collationId, Seq(second.dataType))

Review Comment:
   +1, and make sure you add stringReplace to CollationTypeCasts transform method. Since all inputs are Strings, you can opt for default case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1549420147


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1136,6 +1136,56 @@ public UTF8String replace(UTF8String search, UTF8String replace) {
     return buf.build();
   }
 
+  public UTF8String replace(UTF8String search, UTF8String replace, int collationId) {

Review Comment:
   I think we now have a lot of new methods 
   consider adding comments such as: https://github.com/apache/spark/pull/45791/files#r1549415554



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL][COLLATION] String function support: replace [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1537909524


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -70,6 +73,46 @@ class CollationStringExpressionsSuite extends QueryTest with SharedSparkSession
     })
   }
 
+  test("REPLACE check result on explicitly collated strings") {
+    def testReplace(expected: String, collationId: Integer,
+                    source: String, search: String, replace: String): Unit = {

Review Comment:
   nit, ordering of parameters seem a bit non-intuitive (expected before inputs)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] String function support: replace [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45704:
URL: https://github.com/apache/spark/pull/45704#issuecomment-2018464528

   > SQL tag is sufficient, but I don't mind people adding more grouping if the number of PRs is large enough.
   
   Thank you, @cloud-fan . Then, let's not use this. I don't think this is a permanent grouping.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #45704: [SPARK-47476][SQL] Support REPLACE function to work with collated strings
URL: https://github.com/apache/spark/pull/45704


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1538624688


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1136,6 +1136,104 @@ public UTF8String replace(UTF8String search, UTF8String replace) {
     return buf.build();
   }
 
+  public UTF8String replace(UTF8String search, UTF8String replace, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.replace(search, replace);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return lowercaseReplace(search, replace);
+    }
+    return collatedReplace(search, replace, collationId);
+  }
+
+  public UTF8String lowercaseReplace(UTF8String search, UTF8String replace) {

Review Comment:
   is it faster than `collatedReplace`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1546514048


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1136,6 +1136,57 @@ public UTF8String replace(UTF8String search, UTF8String replace) {
     return buf.build();
   }
 
+  public UTF8String replace(UTF8String search, UTF8String replace, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return this.replace(search, replace);
+    }
+    return collatedReplace(search, replace, collationId);
+  }
+
+  private UTF8String collatedReplace(UTF8String search, UTF8String replace, int collationId) {

Review Comment:
   Let's go with this naming: `collationAwareReplace` for these private functions that take a collationId
   
   ref: https://github.com/apache/spark/pull/45749#discussion_r1546421174



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1550968988


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -735,23 +735,43 @@ case class EndsWith(left: Expression, right: Expression) extends StringPredicate
 case class StringReplace(srcExpr: Expression, searchExpr: Expression, replaceExpr: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
+  final lazy val collationId: Int = first.dataType.asInstanceOf[StringType].collationId
+
   def this(srcExpr: Expression, searchExpr: Expression) = {
     this(srcExpr, searchExpr, Literal(""))
   }
 
   override def nullSafeEval(srcEval: Any, searchEval: Any, replaceEval: Any): Any = {
     srcEval.asInstanceOf[UTF8String].replace(
-      searchEval.asInstanceOf[UTF8String], replaceEval.asInstanceOf[UTF8String])
+      searchEval.asInstanceOf[UTF8String], replaceEval.asInstanceOf[UTF8String], collationId)
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-    nullSafeCodeGen(ctx, ev, (src, search, replace) => {
-      s"""${ev.value} = $src.replace($search, $replace);"""
-    })
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      nullSafeCodeGen(ctx, ev, (src, search, replace) => {
+        s"""${ev.value} = $src.replace($search, $replace);"""
+      })
+    } else {
+      nullSafeCodeGen(ctx, ev, (src, search, replace) => {
+        s"""${ev.value} = $src.replace($search, $replace, $collationId);"""
+      })
+    }
   }
 
-  override def dataType: DataType = StringType
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, StringType)
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val defaultCheck = super.checkInputDataTypes()
+    if (defaultCheck.isFailure) {
+      return defaultCheck
+    }
+
+    // Only srcExpr and searchExpr are checked for collation compatibility.
+    val collationId = first.dataType.asInstanceOf[StringType].collationId
+    CollationTypeConstraints.checkCollationCompatibility(collationId, Seq(second.dataType))

Review Comment:
   heads up: [implicit cast](https://github.com/apache/spark/pull/45383) is now live
   please sync your fork and rebase your changes accordingly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.
miland-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1539225999


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1136,6 +1136,104 @@ public UTF8String replace(UTF8String search, UTF8String replace) {
     return buf.build();
   }
 
+  public UTF8String replace(UTF8String search, UTF8String replace, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.replace(search, replace);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return lowercaseReplace(search, replace);
+    }
+    return collatedReplace(search, replace, collationId);
+  }
+
+  public UTF8String lowercaseReplace(UTF8String search, UTF8String replace) {

Review Comment:
   I think @dbatomic knows better than me if we can/should do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1546263792


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -740,18 +740,39 @@ case class StringReplace(srcExpr: Expression, searchExpr: Expression, replaceExp
   }
 
   override def nullSafeEval(srcEval: Any, searchEval: Any, replaceEval: Any): Any = {
+    val collationId = first.dataType.asInstanceOf[StringType].collationId

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1547414846


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -179,12 +179,26 @@ public static StringSearch getStringSearch(
       final UTF8String left,
       final UTF8String right,
       final int collationId) {
+
+    if(collationId == UTF8_BINARY_LCASE_COLLATION_ID) {
+      return getStringSearchUTF8LCase(left, right);
+    }
+
     String pattern = right.toString();
     CharacterIterator target = new StringCharacterIterator(left.toString());
     Collator collator = CollationFactory.fetchCollation(collationId).collator;
     return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  private static StringSearch getStringSearchUTF8LCase(

Review Comment:
   I would prefer:
   ```
   public static StringSearch getStringSearch(
             final UTF8String patternUTF8String,
             final UTF8String targetUTF8String) {
       return new StringSearch(patternUTF8String.toString(), targetUTF8String.toString());
     }
   ```
   
   and then:
   `getStringSearch(left.toLowerCase(), right.toLowerCase())` on line 184
   
   this version of "getStringSearch" may come in handy in the future, so I think we can do this straight away



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.
miland-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1538955710


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1136,6 +1136,104 @@ public UTF8String replace(UTF8String search, UTF8String replace) {
     return buf.build();
   }
 
+  public UTF8String replace(UTF8String search, UTF8String replace, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.replace(search, replace);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return lowercaseReplace(search, replace);
+    }
+    return collatedReplace(search, replace, collationId);
+  }
+
+  public UTF8String lowercaseReplace(UTF8String search, UTF8String replace) {

Review Comment:
   It is **not** faster. It is a separate method because 
   `StringSearch stringSearch = CollationFactory.getStringSearch(this, search, collationId);` fails for `collationId == 1` because _collator_ for that value is _null_ (see `CollationFactory`). For that reason I had to do comparisons the standard way we do it for `UTF8_BINARY_LCASE` - convert everything to lowercase and compare it that way. 
   
   So the idea is to do comparisons on lowercase strings but use the original string for building the final result, to keep all uppercase and lowercase as in the input.
   
   With other type of collations we can do all comparisons on the original input values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1546263160


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1136,6 +1136,104 @@ public UTF8String replace(UTF8String search, UTF8String replace) {
     return buf.build();
   }
 
+  public UTF8String replace(UTF8String search, UTF8String replace, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.replace(search, replace);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return lowercaseReplace(search, replace);
+    }
+    return collatedReplace(search, replace, collationId);
+  }
+
+  public UTF8String lowercaseReplace(UTF8String search, UTF8String replace) {

Review Comment:
   I think there is simply no `com.ibm.icu.text.Collator` instance for **UTF8_BINARY_LCASE**. After all, UTF8String is a Spark concept, and while I suppose that we could introduce some kind of wrapper collator for UTF8_BINARY_LCASE, it wouldn't seem like a good idea to me
   
   Instead, I think there should be an easy way to get a `com.ibm.icu.text.StringSearch` instance without passing a collationId (CollationFactory currently doesn't support this, but @miland-db can modify it in a way that would allow optional collationId: `new StringSearch(pattern, target)`)
   
   After these changes, I believe `lowercaseReplace` and `collatedReplace` can essentially be combined into a single function, where for **UTF8_BINARY_LCASE** it should be enough to pass `.toLowerCase()` for both **this** & **search** parameters (with no collationId)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1547414846


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -179,12 +179,26 @@ public static StringSearch getStringSearch(
       final UTF8String left,
       final UTF8String right,
       final int collationId) {
+
+    if(collationId == UTF8_BINARY_LCASE_COLLATION_ID) {
+      return getStringSearchUTF8LCase(left, right);
+    }
+
     String pattern = right.toString();
     CharacterIterator target = new StringCharacterIterator(left.toString());
     Collator collator = CollationFactory.fetchCollation(collationId).collator;
     return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  private static StringSearch getStringSearchUTF8LCase(

Review Comment:
   I would prefer:
   ```
   public static StringSearch getStringSearch(final UTF8String pattern, final UTF8String target) {
       return new StringSearch(pattern.toString(), target.toString());
     }
   ```
   
   and then:
   `getStringSearch(left.toLowerCase(), right.toLowerCase())` on line 184
   
   this version of "getStringSearch" may come in handy in the future, so I think we can do this straight away



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1546514048


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1136,6 +1136,57 @@ public UTF8String replace(UTF8String search, UTF8String replace) {
     return buf.build();
   }
 
+  public UTF8String replace(UTF8String search, UTF8String replace, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return this.replace(search, replace);
+    }
+    return collatedReplace(search, replace, collationId);
+  }
+
+  private UTF8String collatedReplace(UTF8String search, UTF8String replace, int collationId) {

Review Comment:
   Let's go with this naming: `collationAwareReplace`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL][COLLATION] String function support: replace [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.
miland-db commented on PR #45704:
URL: https://github.com/apache/spark/pull/45704#issuecomment-2018330828

   @mihailom-db @uros-db @dbatomic


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL][COLLATION] String function support: replace [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45704:
URL: https://github.com/apache/spark/pull/45704#issuecomment-2018373696

   Hi, @miland-db and @cloud-fan . I saw a series of `[COLLATION]`. Are you going to make `[COLLATION]` as the official community module tag in the PR title? For me, `SQL` seems to be enough for SQL module PR.
   
   ```
   $ git log --oneline | grep COLLATION
   8762e256d16 [SPARK-47296][SQL][COLLATION] Fail unsupported functions for non-binary collations
   456d246badb [SPARK-47248][SQL][COLLATION] Improved string function support: contains
   d5f35ec97fc [SPARK-46835][SQL][COLLATIONS] Join support for non-binary collations
   6534a3398ae [SPARK-47102][SQL] Add the `COLLATION_ENABLED` config flag
   ca7c60b4998 [SPARK-47268][SQL][COLLATIONS] Support for repartition with collations
   479954cf73a [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith
   71767cfe376 [SPARK-46834][SQL][COLLATIONS] Support for aggregates
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.
miland-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1546329999


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1136,6 +1136,104 @@ public UTF8String replace(UTF8String search, UTF8String replace) {
     return buf.build();
   }
 
+  public UTF8String replace(UTF8String search, UTF8String replace, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.replace(search, replace);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return lowercaseReplace(search, replace);
+    }
+    return collatedReplace(search, replace, collationId);
+  }
+
+  public UTF8String lowercaseReplace(UTF8String search, UTF8String replace) {

Review Comment:
   I managed to refactor this part of the code and remove `lowercaseReplace`. Now we have getStringSearch method that works with `UTF8_BINARY_LCASE` collation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1547414846


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -179,12 +179,26 @@ public static StringSearch getStringSearch(
       final UTF8String left,
       final UTF8String right,
       final int collationId) {
+
+    if(collationId == UTF8_BINARY_LCASE_COLLATION_ID) {
+      return getStringSearchUTF8LCase(left, right);
+    }
+
     String pattern = right.toString();
     CharacterIterator target = new StringCharacterIterator(left.toString());
     Collator collator = CollationFactory.fetchCollation(collationId).collator;
     return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  private static StringSearch getStringSearchUTF8LCase(

Review Comment:
   I would prefer:
   ```
   public static StringSearch getStringSearch(final UTF8String pattern, final UTF8String target) {
       return new StringSearch(pattern.toString(), target.toString());
     }
   ```
   
   and then:
   `getStringSearch(left, right)` on line 184
   
   this version of "getStringSearch" may come in handy in the future, so I think we can do this straight away



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on PR #45704:
URL: https://github.com/apache/spark/pull/45704#issuecomment-2049812864

   heads up: we’ve done some major code restructuring in https://github.com/apache/spark/pull/45978, so please sync these changes before moving on
   
   @miland-db you’ll likely need to rewrite the code in this PR, so please follow the guidelines outlined in https://issues.apache.org/jira/browse/SPARK-47410


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL][COLLATION] String function support: replace [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45704:
URL: https://github.com/apache/spark/pull/45704#issuecomment-2018429541

   SQL tag is sufficient, but I don't mind people adding more grouping if the number of PRs is large enough.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1539215853


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1136,6 +1136,104 @@ public UTF8String replace(UTF8String search, UTF8String replace) {
     return buf.build();
   }
 
+  public UTF8String replace(UTF8String search, UTF8String replace, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+      return this.replace(search, replace);
+    }
+    if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
+      return lowercaseReplace(search, replace);
+    }
+    return collatedReplace(search, replace, collationId);
+  }
+
+  public UTF8String lowercaseReplace(UTF8String search, UTF8String replace) {

Review Comment:
   Can we create a `CollationFactory` for the collation with id 1?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.
miland-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1547378344


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -179,12 +179,26 @@ public static StringSearch getStringSearch(
       final UTF8String left,
       final UTF8String right,
       final int collationId) {
+
+    if(collationId == UTF8_BINARY_LCASE_COLLATION_ID) {
+      return getStringSearchUTF8LCase(left, right);
+    }
+
     String pattern = right.toString();
     CharacterIterator target = new StringCharacterIterator(left.toString());
     Collator collator = CollationFactory.fetchCollation(collationId).collator;
     return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  private static StringSearch getStringSearchUTF8LCase(

Review Comment:
   This is private method and is just being called from `getStringSearch` for `collationId == 1`. It is a question if we should check for UTF8_BINARY_LCASE in the methods calling `getStringSearch` or if we should adapt original `getStringSearch` method to solve this problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47476][SQL] Support REPLACE function to work with collated strings [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.
miland-db commented on code in PR #45704:
URL: https://github.com/apache/spark/pull/45704#discussion_r1549660968


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -179,12 +179,26 @@ public static StringSearch getStringSearch(
       final UTF8String left,
       final UTF8String right,
       final int collationId) {
+
+    if(collationId == UTF8_BINARY_LCASE_COLLATION_ID) {
+      return getStringSearchUTF8LCase(left, right);
+    }
+
     String pattern = right.toString();
     CharacterIterator target = new StringCharacterIterator(left.toString());
     Collator collator = CollationFactory.fetchCollation(collationId).collator;
     return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  private static StringSearch getStringSearchUTF8LCase(

Review Comment:
   I did it this way now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org