You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "davidm-db (via GitHub)" <gi...@apache.org> on 2024/03/28 09:56:20 UTC

[PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

davidm-db opened a new pull request, #45749:
URL: https://github.com/apache/spark/pull/45749

### Left to do/discussion topics (to be removed by the PR completion)
- Tests (in progress).
- Function doc comments (to be added).
- Handling of a lowercase (`utf8_binary_lcase`) collation type seems a bit grueling in the long run. This PR handles it the same way as in every other place, since I think it is out of the scope of this PR, but should we maybe create a work item to generalize this collation type better (if we don't have it already)?

### What changes were proposed in this pull request?
This PR is created to add support for collations to StringTrim family of functions/expressions, specifically:
- `StringTrim`
- `StringTrimBoth`
- `StringTrimLeft`
- `StringTrimRight`

Changes:
- `stringExpressions.scala`
- Check input types to be of the same collation.
- Route to proper endpoints in `UTF8String` - `doEval` and `doGenCode`.
- 'UTF8String.java`
- Implement logic for all trim functions (logic should be well commented-out in the code and mostly matches already existing implementations of `trimLeft` and `trimRight`).

Other minor changes in this PR:
- Change parameter names in `CollationFactory.getStringSearch()` to a more descriptive ones.
- Spacing and indentation "fixes" in various related parts of the code.

### Why are the changes needed?
We are incrementally adding collation support to a built-in string functions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes:
- User should now be able to use non-default collations in string trim functions.
- User will receive `DATATYPE_MISMATCH.COLLATION_MISMATCH` exception if collations of two function parameters do not match.

### How was this patch tested?
Already existing tests + new unit/e2e tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "davidm-db (via GitHub)" <gi...@apache.org>.

davidm-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1542675474


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -733,27 +873,30 @@ public UTF8String trimRight(UTF8String trimString) {
     int[] stringCharLen = new int[numBytes];
     // array of the first byte position for each character in the source string
     int[] stringCharPos = new int[numBytes];
+
     // build the position and length array
     while (charIdx < numBytes) {
       stringCharPos[numChars] = charIdx;
       stringCharLen[numChars] = numBytesForFirstByte(getByte(charIdx));
       charIdx += stringCharLen[numChars];
-      numChars ++;
+      numChars++;
     }
 
     // index trimEnd points to the first no matching byte position from the right side of
     // the source string.
     int trimEnd = numBytes - 1;
+
     while (numChars > 0) {
       UTF8String searchChar = copyUTF8String(
           stringCharPos[numChars - 1],
           stringCharPos[numChars - 1] + stringCharLen[numChars - 1] - 1);
+
       if (trimString.find(searchChar, 0) >= 0) {
         trimEnd -= stringCharLen[numChars - 1];
       } else {
         break;
       }
-      numChars --;
+      numChars--;

Review Comment:
   Wasn't aware that this might be the issue, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546516509


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -628,6 +646,27 @@ public UTF8String trim(UTF8String trimString) {
     }
   }
 
+  /**
+   * Trims characters of the given trim string from both ends of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the characters from trim string.
+   */
+  public UTF8String trim(UTF8String trimString, int collationId) {

Review Comment:
   Let's go with this naming: `collationAwareTrim`
   
   ref: https://github.com/apache/spark/pull/45749#discussion_r1546421174



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "davidm-db (via GitHub)" <gi...@apache.org>.

davidm-db commented on PR #45749:
URL: https://github.com/apache/spark/pull/45749#issuecomment-2027101103

   @MaxGekk, @cloud-fan could someone please take a look at this PR?
   
   Tagging the rest of Belgrade collation crew if anyone would like to review additionally: @dbatomic, @nikolamand-db, @stefankandic, @uros-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1578907085


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -175,12 +175,19 @@ public Collation(
    * Auxiliary methods for collation aware string operations.
    */
 
+  /**
+   * Creates an instance of ICU's StringSearch with provided parameters.
+   * @param targetUTF8String UTF8String representation of the string to be searched.
+   * @param patternUTF8String UTF8String representation of the string to search for.
+   * @param collationId ID of the collation to use.
+   * @return Created instance of StringSearch.
+   */
   public static StringSearch getStringSearch(
-      final UTF8String left,
-      final UTF8String right,
+      final UTF8String targetUTF8String,
+      final UTF8String patternUTF8String,
       final int collationId) {
-    String pattern = right.toString();
-    CharacterIterator target = new StringCharacterIterator(left.toString());
+    String pattern = patternUTF8String.toString();
+    CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString());
     Collator collator = CollationFactory.fetchCollation(collationId).collator;
     return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }

Review Comment:
   did you pull the latest changes on master? we now have a different version of stringSearch



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -175,12 +175,19 @@ public Collation(
    * Auxiliary methods for collation aware string operations.
    */
 
+  /**
+   * Creates an instance of ICU's StringSearch with provided parameters.
+   * @param targetUTF8String UTF8String representation of the string to be searched.
+   * @param patternUTF8String UTF8String representation of the string to search for.
+   * @param collationId ID of the collation to use.
+   * @return Created instance of StringSearch.
+   */
   public static StringSearch getStringSearch(
-      final UTF8String left,
-      final UTF8String right,
+      final UTF8String targetUTF8String,
+      final UTF8String patternUTF8String,
       final int collationId) {
-    String pattern = right.toString();
-    CharacterIterator target = new StringCharacterIterator(left.toString());
+    String pattern = patternUTF8String.toString();
+    CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString());
     Collator collator = CollationFactory.fetchCollation(collationId).collator;
     return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }

Review Comment:
   https://github.com/apache/spark/blob/b4624bf4be28974e8df175670f89cf96858b7b81/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java#L195



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546539089


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -767,6 +947,139 @@ public UTF8String trimRight(UTF8String trimString) {
     return copyUTF8String(0, trimEnd);
   }
 
+  /**
+   * Trims characters of the given trim string from the end of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the trim characters at the end.
+   */
+  public UTF8String trimRight(UTF8String trimString, int collationId) {

Review Comment:
   yes, there are some double checks when calling these UTF8String functions from codegen for UTF8_BINARY_LCASE
   
   although it's only additional "if" checks, it could have perf implications
   1. we should establish a benchmark to see how it actually affects overall performance
   2. we should consider a unified way to handle this collation-specific code branching:
   2a. we can special-case UTF8_BINARY_LCASE when doing genCode
   2b. we can make a static map for all supported {collationId, functionName} pairs
   
   however, I suggest that we tackle that in a separate effort, as it seems a bit out of scope for this ticket



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "davidm-db (via GitHub)" <gi...@apache.org>.

davidm-db closed pull request #45749: [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions
URL: https://github.com/apache/spark/pull/45749


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546505391


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -686,6 +743,111 @@ public UTF8String trimLeft(UTF8String trimString) {
     return copyUTF8String(trimIdx, numBytes - 1);
   }
 
+  /**
+   * Trims characters of the given trim string from the start of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the trim characters at the start.
+   */
+  public UTF8String trimLeft(UTF8String trimString, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return trimLeft(trimString);
+    }
+
+    if (CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID == collationId) {
+      return lowercaseTrimLeft(trimString);
+    }
+
+    return collatedTrimLeft(trimString, collationId);
+  }
+
+  private UTF8String lowercaseTrimLeft(UTF8String trimString) {
+    if (trimString == null) {
+      return null;
+    }
+
+    // The searching byte position in the lowercase source string
+    int searchIdx = 0;
+    // The byte position of a first non-matching character in the lowercase source string
+    int trimByteIdx = 0;
+
+    // Convert trimString to lowercase so it can be searched properly
+    trimString = trimString.toLowerCase();

Review Comment:
   not in the context of string expressions, here we use custom `UTF8String` functions:
   ```
   public UTF8String toLowerCase()
   public UTF8String toUpperCase()
   ```
   
   note that in this context we only use this for UTF8_BINARY_LCASE, so no locale to be specified



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546505391


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -686,6 +743,111 @@ public UTF8String trimLeft(UTF8String trimString) {
     return copyUTF8String(trimIdx, numBytes - 1);
   }
 
+  /**
+   * Trims characters of the given trim string from the start of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the trim characters at the start.
+   */
+  public UTF8String trimLeft(UTF8String trimString, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return trimLeft(trimString);
+    }
+
+    if (CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID == collationId) {
+      return lowercaseTrimLeft(trimString);
+    }
+
+    return collatedTrimLeft(trimString, collationId);
+  }
+
+  private UTF8String lowercaseTrimLeft(UTF8String trimString) {
+    if (trimString == null) {
+      return null;
+    }
+
+    // The searching byte position in the lowercase source string
+    int searchIdx = 0;
+    // The byte position of a first non-matching character in the lowercase source string
+    int trimByteIdx = 0;
+
+    // Convert trimString to lowercase so it can be searched properly
+    trimString = trimString.toLowerCase();

Review Comment:
   no, here we actually use custom `UTF8String` functions:
   ```
   public UTF8String toLowerCase()
   public UTF8String toUpperCase()
   ```
   
   note that in this context we only use this for UTF8_BINARY_LCASE, so no locale to be specified



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546428567


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -686,6 +743,111 @@ public UTF8String trimLeft(UTF8String trimString) {
     return copyUTF8String(trimIdx, numBytes - 1);
   }
 
+  /**
+   * Trims characters of the given trim string from the start of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the trim characters at the start.
+   */
+  public UTF8String trimLeft(UTF8String trimString, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return trimLeft(trimString);
+    }
+
+    if (CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID == collationId) {
+      return lowercaseTrimLeft(trimString);
+    }
+
+    return collatedTrimLeft(trimString, collationId);
+  }
+
+  private UTF8String lowercaseTrimLeft(UTF8String trimString) {
+    if (trimString == null) {
+      return null;
+    }
+
+    // The searching byte position in the lowercase source string
+    int searchIdx = 0;
+    // The byte position of a first non-matching character in the lowercase source string
+    int trimByteIdx = 0;
+
+    // Convert trimString to lowercase so it can be searched properly
+    trimString = trimString.toLowerCase();

Review Comment:
   we should probably explicitly set the root locale for all lowercase/uppercase calls, right? @uros-db 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546326805


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -1054,32 +1071,64 @@ trait String2TrimExpression extends Expression with ImplicitCastInputTypes {
     val evals = children.map(_.genCode(ctx))
     val srcString = evals(0)
 
-    if (evals.length == 1) {
-      ev.copy(code = code"""
-         |${srcString.code}
-         |boolean ${ev.isNull} = false;
-         |UTF8String ${ev.value} = null;
-         |if (${srcString.isNull}) {
-         |  ${ev.isNull} = true;
-         |} else {
-         |  ${ev.value} = ${srcString.value}.$trimMethod();
-         |}""".stripMargin)
-    } else {
-      val trimString = evals(1)
-      ev.copy(code = code"""
-         |${srcString.code}
-         |boolean ${ev.isNull} = false;
-         |UTF8String ${ev.value} = null;
-         |if (${srcString.isNull}) {
-         |  ${ev.isNull} = true;
-         |} else {
-         |  ${trimString.code}
-         |  if (${trimString.isNull}) {
-         |    ${ev.isNull} = true;
-         |  } else {
-         |    ${ev.value} = ${srcString.value}.$trimMethod(${trimString.value});
-         |  }
-         |}""".stripMargin)
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      if (evals.length == 1) {
+        ev.copy(code = code"""
+          |${srcString.code}
+          |boolean ${ev.isNull} = false;
+          |UTF8String ${ev.value} = null;
+          |if (${srcString.isNull}) {
+          |  ${ev.isNull} = true;
+          |} else {
+          |  ${ev.value} = ${srcString.value}.$trimMethod();
+          |}""".stripMargin)
+      } else {
+        val trimString = evals(1)
+        ev.copy(code = code"""
+          |${srcString.code}
+          |boolean ${ev.isNull} = false;
+          |UTF8String ${ev.value} = null;
+          |if (${srcString.isNull}) {
+          |  ${ev.isNull} = true;
+          |} else {
+          |  ${trimString.code}
+          |  if (${trimString.isNull}) {
+          |    ${ev.isNull} = true;
+          |  } else {
+          |    ${ev.value} = ${srcString.value}.$trimMethod(${trimString.value});
+          |  }
+          |}""".stripMargin)
+      }
+    }
+    else {
+      if (evals.length == 1) {
+        ev.copy(code = code"""
+          |${srcString.code}
+          |boolean ${ev.isNull} = false;
+          |UTF8String ${ev.value} = null;
+          |if (${srcString.isNull}) {
+          |  ${ev.isNull} = true;
+          |} else {
+          |  ${ev.value} = ${srcString.value}.$trimMethod($collationId);
+          |}""".stripMargin)
+      } else {
+        val trimString = evals(1)
+        ev.copy(code = code"""
+          |${srcString.code}
+          |boolean ${ev.isNull} = false;
+          |UTF8String ${ev.value} = null;
+          |if (${srcString.isNull}) {
+          |  ${ev.isNull} = true;
+          |} else {
+          |  ${trimString.code}
+          |  if (${trimString.isNull}) {
+          |    ${ev.isNull} = true;
+          |  } else {
+          |    ${ev.value} =
+          |      ${srcString.value}.$trimMethod(${trimString.value}, $collationId);
+          |  }
+          |}""".stripMargin)
+      }

Review Comment:
   I think this code is very similar to the one above, consider rewriting it a bit more neatly in a way that incorporates $collationId



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546328445


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -1170,12 +1219,25 @@ case class StringTrim(srcStr: Expression, trimStr: Option[Expression] = None)
 
   override protected def direction: String = "BOTH"
 
-  override def doEval(srcString: UTF8String): UTF8String = srcString.trim()
+  override val trimMethod: String = "trim"

Review Comment:
   avoid unnecessary changes (I think this was below `override def doEval`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546432009


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -767,6 +947,139 @@ public UTF8String trimRight(UTF8String trimString) {
     return copyUTF8String(0, trimEnd);
   }
 
+  /**
+   * Trims characters of the given trim string from the end of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the trim characters at the end.
+   */
+  public UTF8String trimRight(UTF8String trimString, int collationId) {

Review Comment:
   as far as I see this is the only public method exposing this trim api, so does that mean that we will always have to perform collation check in order to call collated trim, even if in codegen we know which collation we are working with?
   
   this seems like a perf issue when you consider running this for millions of rows



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1578907442


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -175,12 +175,19 @@ public Collation(
    * Auxiliary methods for collation aware string operations.
    */
 
+  /**
+   * Creates an instance of ICU's StringSearch with provided parameters.
+   * @param targetUTF8String UTF8String representation of the string to be searched.
+   * @param patternUTF8String UTF8String representation of the string to search for.
+   * @param collationId ID of the collation to use.
+   * @return Created instance of StringSearch.
+   */
   public static StringSearch getStringSearch(
-      final UTF8String left,
-      final UTF8String right,
+      final UTF8String targetUTF8String,
+      final UTF8String patternUTF8String,
       final int collationId) {
-    String pattern = right.toString();
-    CharacterIterator target = new StringCharacterIterator(left.toString());
+    String pattern = patternUTF8String.toString();
+    CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString());
     Collator collator = CollationFactory.fetchCollation(collationId).collator;
     return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }

Review Comment:
   https://github.com/apache/spark/blob/b4624bf4be28974e8df175670f89cf96858b7b81/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java#L195



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on PR #45749:
URL: https://github.com/apache/spark/pull/45749#issuecomment-2024866802

   Also, please add tests to `CollationStringExpressionSuite`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546516509


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -628,6 +646,27 @@ public UTF8String trim(UTF8String trimString) {
     }
   }
 
+  /**
+   * Trims characters of the given trim string from both ends of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the characters from trim string.
+   */
+  public UTF8String trim(UTF8String trimString, int collationId) {

Review Comment:
   Let's go with this naming: `collationAwareTrim`
   
   ref: https://github.com/apache/spark/pull/45749#discussion_r1546421174



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546329905


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -1278,12 +1340,25 @@ case class StringTrimLeft(srcStr: Expression, trimStr: Option[Expression] = None
 
   override protected def direction: String = "LEADING"
 
-  override def doEval(srcString: UTF8String): UTF8String = srcString.trimLeft()
+  override val trimMethod: String = "trimLeft"

Review Comment:
   ditto (avoid unnecessary changes)



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -1339,12 +1414,25 @@ case class StringTrimRight(srcStr: Expression, trimStr: Option[Expression] = Non
 
   override protected def direction: String = "TRAILING"
 
-  override def doEval(srcString: UTF8String): UTF8String = srcString.trimRight()
+  override val trimMethod: String = "trimRight"

Review Comment:
   ditto (avoid unnecessary changes)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546489340


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -686,6 +743,111 @@ public UTF8String trimLeft(UTF8String trimString) {
     return copyUTF8String(trimIdx, numBytes - 1);
   }
 
+  /**
+   * Trims characters of the given trim string from the start of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the trim characters at the start.
+   */
+  public UTF8String trimLeft(UTF8String trimString, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return trimLeft(trimString);
+    }
+
+    if (CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID == collationId) {
+      return lowercaseTrimLeft(trimString);
+    }
+
+    return collatedTrimLeft(trimString, collationId);
+  }
+
+  private UTF8String lowercaseTrimLeft(UTF8String trimString) {
+    if (trimString == null) {
+      return null;
+    }
+
+    // The searching byte position in the lowercase source string
+    int searchIdx = 0;
+    // The byte position of a first non-matching character in the lowercase source string
+    int trimByteIdx = 0;
+
+    // Convert trimString to lowercase so it can be searched properly
+    trimString = trimString.toLowerCase();

Review Comment:
   yes, in the context of string expressions, we use `.toLowerCase()` / `.toUpperCase()` only for _UTF8_BINARY_LCASE_ so we should consider specifying the base locale (**Locale.ROOT**) in those calls



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546509675


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -686,6 +743,111 @@ public UTF8String trimLeft(UTF8String trimString) {
     return copyUTF8String(trimIdx, numBytes - 1);
   }
 
+  /**
+   * Trims characters of the given trim string from the start of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the trim characters at the start.
+   */
+  public UTF8String trimLeft(UTF8String trimString, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return trimLeft(trimString);
+    }
+
+    if (CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID == collationId) {
+      return lowercaseTrimLeft(trimString);
+    }
+
+    return collatedTrimLeft(trimString, collationId);
+  }
+
+  private UTF8String lowercaseTrimLeft(UTF8String trimString) {
+    if (trimString == null) {
+      return null;
+    }
+
+    // The searching byte position in the lowercase source string
+    int searchIdx = 0;
+    // The byte position of a first non-matching character in the lowercase source string
+    int trimByteIdx = 0;
+
+    // Convert trimString to lowercase so it can be searched properly
+    trimString = trimString.toLowerCase();
+
+    while (searchIdx < numBytes) {
+      UTF8String searchChar = copyUTF8String(
+        searchIdx,
+        searchIdx + numBytesForFirstByte(getByte(searchIdx)) - 1);
+      int searchCharBytes = searchChar.numBytes;
+
+      // Try to find the matching for the lowercase searchChar in the trimString
+      if (trimString.find(searchChar.toLowerCase(), 0) >= 0) {
+        trimByteIdx += searchCharBytes;
+        searchIdx += searchCharBytes;
+      }
+      else {
+        // No matching, exit the search
+        break;
+      }
+    }
+
+    if (searchIdx == 0) {
+      // Nothing trimmed - return original string (not converted to lowercase)
+      return this;
+    }
+    if (trimByteIdx  >= numBytes) {
+      // Everything trimmed
+      return EMPTY_UTF8;
+    }
+    return copyUTF8String(trimByteIdx, numBytes - 1);
+  }
+
+  private UTF8String collatedTrimLeft(UTF8String trimString, int collationId) {

Review Comment:
   `collationAwareFunctionName` sounds good to me, we can discuss consolidating the naming for all functions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1542648461


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -176,11 +176,11 @@ public Collation(
    */
 
   public static StringSearch getStringSearch(
-      final UTF8String left,
-      final UTF8String right,
+      final UTF8String targetString,
+      final UTF8String patternString,

Review Comment:
   Agreed that this should be changed, but I find it a bit weird. We call it `targetString` and then call `toString` method on it later. I would maybe add something like `targetUTF8String`. @dbatomic what do you think about this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546421174


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -686,6 +743,111 @@ public UTF8String trimLeft(UTF8String trimString) {
     return copyUTF8String(trimIdx, numBytes - 1);
   }
 
+  /**
+   * Trims characters of the given trim string from the start of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the trim characters at the start.
+   */
+  public UTF8String trimLeft(UTF8String trimString, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return trimLeft(trimString);
+    }
+
+    if (CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID == collationId) {
+      return lowercaseTrimLeft(trimString);
+    }
+
+    return collatedTrimLeft(trimString, collationId);
+  }
+
+  private UTF8String lowercaseTrimLeft(UTF8String trimString) {
+    if (trimString == null) {
+      return null;
+    }
+
+    // The searching byte position in the lowercase source string
+    int searchIdx = 0;
+    // The byte position of a first non-matching character in the lowercase source string
+    int trimByteIdx = 0;
+
+    // Convert trimString to lowercase so it can be searched properly
+    trimString = trimString.toLowerCase();
+
+    while (searchIdx < numBytes) {
+      UTF8String searchChar = copyUTF8String(
+        searchIdx,
+        searchIdx + numBytesForFirstByte(getByte(searchIdx)) - 1);
+      int searchCharBytes = searchChar.numBytes;
+
+      // Try to find the matching for the lowercase searchChar in the trimString
+      if (trimString.find(searchChar.toLowerCase(), 0) >= 0) {
+        trimByteIdx += searchCharBytes;
+        searchIdx += searchCharBytes;
+      }
+      else {
+        // No matching, exit the search
+        break;
+      }
+    }
+
+    if (searchIdx == 0) {
+      // Nothing trimmed - return original string (not converted to lowercase)
+      return this;
+    }
+    if (trimByteIdx  >= numBytes) {
+      // Everything trimmed
+      return EMPTY_UTF8;
+    }
+    return copyUTF8String(trimByteIdx, numBytes - 1);
+  }
+
+  private UTF8String collatedTrimLeft(UTF8String trimString, int collationId) {

Review Comment:
   @uros-db I see a number of PRs from different people adding similar functionality so just wondering if you can start a discussion to standardize the names of these kinds of methods because I've seen multiple variants
   
   my vote would be for `collationAwareOp`



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -686,6 +743,111 @@ public UTF8String trimLeft(UTF8String trimString) {
     return copyUTF8String(trimIdx, numBytes - 1);
   }
 
+  /**
+   * Trims characters of the given trim string from the start of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the trim characters at the start.
+   */
+  public UTF8String trimLeft(UTF8String trimString, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return trimLeft(trimString);
+    }
+
+    if (CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID == collationId) {
+      return lowercaseTrimLeft(trimString);
+    }
+
+    return collatedTrimLeft(trimString, collationId);
+  }
+
+  private UTF8String lowercaseTrimLeft(UTF8String trimString) {
+    if (trimString == null) {
+      return null;
+    }
+
+    // The searching byte position in the lowercase source string
+    int searchIdx = 0;
+    // The byte position of a first non-matching character in the lowercase source string
+    int trimByteIdx = 0;
+
+    // Convert trimString to lowercase so it can be searched properly
+    trimString = trimString.toLowerCase();
+
+    while (searchIdx < numBytes) {
+      UTF8String searchChar = copyUTF8String(
+        searchIdx,
+        searchIdx + numBytesForFirstByte(getByte(searchIdx)) - 1);
+      int searchCharBytes = searchChar.numBytes;
+
+      // Try to find the matching for the lowercase searchChar in the trimString
+      if (trimString.find(searchChar.toLowerCase(), 0) >= 0) {
+        trimByteIdx += searchCharBytes;
+        searchIdx += searchCharBytes;
+      }
+      else {
+        // No matching, exit the search
+        break;
+      }
+    }
+
+    if (searchIdx == 0) {
+      // Nothing trimmed - return original string (not converted to lowercase)
+      return this;
+    }
+    if (trimByteIdx  >= numBytes) {
+      // Everything trimmed
+      return EMPTY_UTF8;
+    }
+    return copyUTF8String(trimByteIdx, numBytes - 1);
+  }
+
+  private UTF8String collatedTrimLeft(UTF8String trimString, int collationId) {

Review Comment:
   @uros-db I see a number of PRs from different people adding similar functionality so just wondering if you can start a discussion to standardize the names of these kinds of methods because I've seen multiple variants
   
   my vote would be for `collationAware...`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on PR #45749:
URL: https://github.com/apache/spark/pull/45749#issuecomment-2049814327

   heads up: we’ve done some major code restructuring in https://github.com/apache/spark/pull/45978, so please sync these changes before moving on
   
   @davidm-db you’ll likely need to rewrite the code in this PR, so please follow the guidelines outlined in https://issues.apache.org/jira/browse/SPARK-47410


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on PR #45749:
URL: https://github.com/apache/spark/pull/45749#issuecomment-2055969457

   Please add these expressions to CollationTypeCasts rules.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1578907085


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -175,12 +175,19 @@ public Collation(
    * Auxiliary methods for collation aware string operations.
    */
 
+  /**
+   * Creates an instance of ICU's StringSearch with provided parameters.
+   * @param targetUTF8String UTF8String representation of the string to be searched.
+   * @param patternUTF8String UTF8String representation of the string to search for.
+   * @param collationId ID of the collation to use.
+   * @return Created instance of StringSearch.
+   */
   public static StringSearch getStringSearch(
-      final UTF8String left,
-      final UTF8String right,
+      final UTF8String targetUTF8String,
+      final UTF8String patternUTF8String,
       final int collationId) {
-    String pattern = right.toString();
-    CharacterIterator target = new StringCharacterIterator(left.toString());
+    String pattern = patternUTF8String.toString();
+    CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString());
     Collator collator = CollationFactory.fetchCollation(collationId).collator;
     return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }

Review Comment:
   did you pull the latest changes on master? we now have a different version of stringSearch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1542665840


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -733,27 +873,30 @@ public UTF8String trimRight(UTF8String trimString) {
     int[] stringCharLen = new int[numBytes];
     // array of the first byte position for each character in the source string
     int[] stringCharPos = new int[numBytes];
+
     // build the position and length array
     while (charIdx < numBytes) {
       stringCharPos[numChars] = charIdx;
       stringCharLen[numChars] = numBytesForFirstByte(getByte(charIdx));
       charIdx += stringCharLen[numChars];
-      numChars ++;
+      numChars++;
     }
 
     // index trimEnd points to the first no matching byte position from the right side of
     // the source string.
     int trimEnd = numBytes - 1;
+
     while (numChars > 0) {
       UTF8String searchChar = copyUTF8String(
           stringCharPos[numChars - 1],
           stringCharPos[numChars - 1] + stringCharLen[numChars - 1] - 1);
+
       if (trimString.find(searchChar, 0) >= 0) {
         trimEnd -= stringCharLen[numChars - 1];
       } else {
         break;
       }
-      numChars --;
+      numChars--;

Review Comment:
   Avoid reformatting the whole file, as it creates a lot of backporting issues. If you want to reformat your code select the part and pres cmd + option + L.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546417209


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -767,6 +947,139 @@ public UTF8String trimRight(UTF8String trimString) {
     return copyUTF8String(0, trimEnd);
   }
 
+  /**
+   * Trims characters of the given trim string from the end of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the trim characters at the end.
+   */
+  public UTF8String trimRight(UTF8String trimString, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return trimRight(trimString);
+    }
+
+    if (CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID == collationId) {
+      return lowercaseTrimRight(trimString);
+    }
+
+    return collatedTrimRight(trimString, collationId);
+  }
+
+  private UTF8String lowercaseTrimRight(UTF8String trimString) {
+    if (trimString == null) {
+      return null;
+    }
+
+    // Convert trimString to lowercase so it can be searched properly
+    trimString = trimString.toLowerCase();
+
+    // Number of bytes iterated from the source string
+    int byteIdx = 0;
+    // Number of characters iterated from the source string
+    int numChars = 0;
+    // Array of character length for the source string
+    int[] stringCharLen = new int[numBytes];
+    // Array of the first byte position for each character in the source string
+    int[] stringCharPos = new int[numBytes];
+
+    // Build the position and length array
+    while (byteIdx < numBytes) {
+      stringCharPos[numChars] = byteIdx;
+      stringCharLen[numChars] = numBytesForFirstByte(getByte(byteIdx));
+      byteIdx += stringCharLen[numChars];
+      numChars++;
+    }
+
+    // Index trimEnd points to the first no matching byte position from the right side of
+    //  the source string.
+    int trimByteIdx = numBytes - 1;
+
+    while (numChars > 0) {
+      UTF8String searchChar = copyUTF8String(
+        stringCharPos[numChars - 1],
+        stringCharPos[numChars - 1] + stringCharLen[numChars - 1] - 1);
+
+      // Try to find the matching for the lowercase searchChar in the trimString
+      if (trimString.find(searchChar.toLowerCase(), 0) >= 0) {
+        trimByteIdx -= stringCharLen[numChars - 1];
+        numChars--;
+      }
+      else {

Review Comment:
   nit: in jvm languages else is almost always on the same line as the closing brace of it's if



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546489340


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -686,6 +743,111 @@ public UTF8String trimLeft(UTF8String trimString) {
     return copyUTF8String(trimIdx, numBytes - 1);
   }
 
+  /**
+   * Trims characters of the given trim string from the start of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the trim characters at the start.
+   */
+  public UTF8String trimLeft(UTF8String trimString, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return trimLeft(trimString);
+    }
+
+    if (CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID == collationId) {
+      return lowercaseTrimLeft(trimString);
+    }
+
+    return collatedTrimLeft(trimString, collationId);
+  }
+
+  private UTF8String lowercaseTrimLeft(UTF8String trimString) {
+    if (trimString == null) {
+      return null;
+    }
+
+    // The searching byte position in the lowercase source string
+    int searchIdx = 0;
+    // The byte position of a first non-matching character in the lowercase source string
+    int trimByteIdx = 0;
+
+    // Convert trimString to lowercase so it can be searched properly
+    trimString = trimString.toLowerCase();

Review Comment:
   yes, in the context of string expressions, we use `.toLowerCase()` / `.toUpperCase()` only for _UTF8_BINARY_LCASE_ so we should consider specifying the base locale (**Locale.ROOT**) in those calls



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546312983


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -686,6 +743,111 @@ public UTF8String trimLeft(UTF8String trimString) {
     return copyUTF8String(trimIdx, numBytes - 1);
   }
 
+  /**
+   * Trims characters of the given trim string from the start of this string.
+   * This variant of the method additionally applies provided collation to this string
+   * and trim characters before searching.
+   *
+   * @param trimString The trim characters string.
+   * @param collationId Id of the collation to use.
+   * @return this string with no occurrences of the trim characters at the start.
+   */
+  public UTF8String trimLeft(UTF8String trimString, int collationId) {
+    if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+      return trimLeft(trimString);
+    }
+
+    if (CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID == collationId) {
+      return lowercaseTrimLeft(trimString);
+    }
+
+    return collatedTrimLeft(trimString, collationId);
+  }
+
+  private UTF8String lowercaseTrimLeft(UTF8String trimString) {

Review Comment:
   Instead of implementing `lowercaseTrimLeft` and `collatedTrimLeft` separately (these functions look very similar to me), I think we could make use of new `StringSearch(pattern, target)` (with `.toLowerCase()` for both params, and no collationId for **UTF8_BINARY_LCASE**)
   
   For more context, please take a look at: https://github.com/apache/spark/pull/45704/files#r1538624688



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45749:
URL: https://github.com/apache/spark/pull/45749#discussion_r1546323015


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -1028,15 +1028,32 @@ trait String2TrimExpression extends Expression with ImplicitCastInputTypes {
   protected def direction: String
 
   override def children: Seq[Expression] = srcStr +: trimStr.toSeq
-  override def dataType: DataType = StringType
-  override def inputTypes: Seq[AbstractDataType] = Seq.fill(children.size)(StringType)
+  override def dataType: DataType = srcStr.dataType
+  override def inputTypes: Seq[AbstractDataType] =
+    Seq(StringTypeAnyCollation, StringTypeAnyCollation)
+
+  final lazy val collationId: Int = srcStr.dataType.asInstanceOf[StringType].collationId
 
   override def nullable: Boolean = children.exists(_.nullable)
   override def foldable: Boolean = children.forall(_.foldable)
 
   protected def doEval(srcString: UTF8String): UTF8String
   protected def doEval(srcString: UTF8String, trimString: UTF8String): UTF8String
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val defaultCheckResult = super.checkInputDataTypes()
+    if (defaultCheckResult.isFailure) {
+      return defaultCheckResult
+    }
+
+    trimStr match {
+      case None => TypeCheckResult.TypeCheckSuccess
+      case Some(trimChars) =>
+        val collationId = srcStr.dataType.asInstanceOf[StringType].collationId

Review Comment:
   I think you already have this above as `final lazy val`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

Posted by "davidm-db (via GitHub)" <gi...@apache.org>.

davidm-db commented on PR #45749:
URL: https://github.com/apache/spark/pull/45749#issuecomment-2024875386

   @mihailom-db I'm adding tests to both `CollationStringExpressionSuite` and `CollationSuite` - I just pushed the changes so they can be reviewed while I'm adding the tests :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org