You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "stevomitric (via GitHub)" <gi...@apache.org> on 2024/03/07 15:36:24 UTC

[PR] [SPARK-47295] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

stevomitric opened a new pull request, #45421:
URL: https://github.com/apache/spark/pull/45421

   ### What changes were proposed in this pull request?
   Modified `startsWith` and `endsWith` functions by including `StringSearch` for `UNICODE` and `UNICODE_CI` collations.
   
   
   ### Why are the changes needed?
   Optimized functions for faster performance.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Used already built `CollationSuite` tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1519387286


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -378,13 +378,6 @@ public boolean matchAt(final UTF8String s, int pos) {
     return ByteArrayMethods.arrayEquals(base, offset + pos, s.base, s.offset, s.numBytes);
   }
 
-  private boolean matchAt(final UTF8String s, int pos, int collationId) {
-    if (s.numBytes + pos > numBytes || pos < 0) {
-      return false;
-    }
-    return this.substring(pos, pos + s.numBytes).semanticCompare(s, collationId) == 0;

Review Comment:
   We can, that's exactly what Stevo is doing - removing this code and replacing it with `CollationFactory.getStringSearch`. For more context: the old implementation was there before we introduced `StringSearch`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1516379232


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -31,6 +32,8 @@
 import com.esotericsoftware.kryo.io.Input;
 import com.esotericsoftware.kryo.io.Output;
 
+import com.ibm.icu.text.RuleBasedCollator;

Review Comment:
   we should not expose collator outside of `CollationFactory`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1517975311


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -410,7 +414,19 @@ public boolean endsWith(final UTF8String suffix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().endsWith(suffix.toLowerCase());
     }
-    return matchAt(suffix, numBytes - suffix.numBytes, collationId);
+    return collatedEndsWith(suffix, collationId);
+  }
+
+  private boolean collatedEndsWith(final UTF8String suffix, int collationId) {
+    if (suffix.numBytes == 0 || this.numBytes == 0) {
+      return suffix.numBytes==0;

Review Comment:
   nit: space around ==



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "stevomitric (via GitHub)" <gi...@apache.org>.
stevomitric commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1517360283


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -396,7 +396,9 @@ public boolean startsWith(final UTF8String prefix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().startsWith(prefix.toLowerCase());
     }
-    return matchAt(prefix, 0, collationId);

Review Comment:
   Deleted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL] Added ICU StringSearch for the `startsWith` and `endsWith` functions [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #45421:
URL: https://github.com/apache/spark/pull/45421#issuecomment-1991415905

   +1, LGTM. Merging to master.
   Thank you, @stevomitric and @dbatomic @uros-db @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1519361693


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -378,13 +378,6 @@ public boolean matchAt(final UTF8String s, int pos) {
     return ByteArrayMethods.arrayEquals(base, offset + pos, s.base, s.offset, s.numBytes);
   }
 
-  private boolean matchAt(final UTF8String s, int pos, int collationId) {
-    if (s.numBytes + pos > numBytes || pos < 0) {
-      return false;
-    }
-    return this.substring(pos, pos + s.numBytes).semanticCompare(s, collationId) == 0;

Review Comment:
   I wonder can't we just use `CollationFactory.getStringSearch` here?



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -396,7 +389,18 @@ public boolean startsWith(final UTF8String prefix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().startsWith(prefix.toLowerCase());
     }
-    return matchAt(prefix, 0, collationId);
+    return collatedStartsWith(prefix, collationId);
+  }
+
+  private boolean collatedStartsWith(final UTF8String prefix, int collationId) {

Review Comment:
   Could you point out which unit tests in `UTF8StringSuite` check the functions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1519451854


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -396,7 +389,18 @@ public boolean startsWith(final UTF8String prefix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().startsWith(prefix.toLowerCase());
     }
-    return matchAt(prefix, 0, collationId);
+    return collatedStartsWith(prefix, collationId);
+  }
+
+  private boolean collatedStartsWith(final UTF8String prefix, int collationId) {

Review Comment:
   I think that @MaxGekk has a point.
   `CollationSuite` becomes a bit over cluttered and it should be used for for E2E testing. Can we either add unit test to `UTF8StringSuite` or create `UTF8StringSuiteWithCollation` suite?
   
   Option is also to use `CollationExpressionSuite`.
   
   I would propose following:
   1) E2E collation tests go to `CollationSuite`.
   2) String level tests go to either `UTF8StringSuite` or `UTF8StringWithCollationSuite`.
   3) Expression level tests go to `CollationExpressionSuite`.
   4) Collation management tests to to `CollationFactorySuite`.
   
   @MaxGekk and @cloud-fan - what are your thought on this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1517229022


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -410,7 +412,9 @@ public boolean endsWith(final UTF8String suffix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().endsWith(suffix.toLowerCase());
     }
-    return matchAt(suffix, numBytes - suffix.numBytes, collationId);
+    if (suffix.numBytes == 0 || this.numBytes == 0) return suffix.numBytes==0;
+
+    return CollationFactory.getStringSearch(this, suffix, collationId).last()==this.numChars()-suffix.numChars();

Review Comment:
   for clarity, I think we can separate this out into `private boolean collatedEndsWith`



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -410,7 +412,9 @@ public boolean endsWith(final UTF8String suffix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().endsWith(suffix.toLowerCase());
     }
-    return matchAt(suffix, numBytes - suffix.numBytes, collationId);
+    if (suffix.numBytes == 0 || this.numBytes == 0) return suffix.numBytes==0;
+
+    return CollationFactory.getStringSearch(this, suffix, collationId).last()==this.numChars()-suffix.numChars();

Review Comment:
   for clarity, I think we can separate this out into `private boolean collatedEndsWith(...)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1519379776


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -396,7 +389,18 @@ public boolean startsWith(final UTF8String prefix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().startsWith(prefix.toLowerCase());
     }
-    return matchAt(prefix, 0, collationId);
+    return collatedStartsWith(prefix, collationId);
+  }
+
+  private boolean collatedStartsWith(final UTF8String prefix, int collationId) {

Review Comment:
   I think we test these functions only in CollationSuite now, as they are only used in context of collations anyways. To remove clutter, we opted to remove unit tests from `UTF8StringSuite` and `StringExpressionSuite` because those in `CollationSuite` seem to do just fine



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1517978032


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -410,7 +414,19 @@ public boolean endsWith(final UTF8String suffix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().endsWith(suffix.toLowerCase());
     }
-    return matchAt(suffix, numBytes - suffix.numBytes, collationId);
+    return collatedEndsWith(suffix, collationId);
+  }
+
+  private boolean collatedEndsWith(final UTF8String suffix, int collationId) {
+    if (suffix.numBytes == 0 || this.numBytes == 0) {
+      return suffix.numBytes==0;
+    }
+    if (suffix.numChars() > this.numChars()) {
+      return false;
+    }
+    return CollationFactory.getStringSearch(
+            this.substring(this.numChars()-suffix.numChars(), this.numChars()),

Review Comment:
   identation is a bit weird here. 4 spaces should be more appropriate? Please align with the rest of file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1519302618


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -410,7 +414,20 @@ public boolean endsWith(final UTF8String suffix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().endsWith(suffix.toLowerCase());
     }
-    return matchAt(suffix, numBytes - suffix.numBytes, collationId);
+    return collatedEndsWith(suffix, collationId);
+  }
+
+  private boolean collatedEndsWith(final UTF8String suffix, int collationId) {
+    if (suffix.numBytes == 0 || this.numBytes == 0) {
+      return suffix.numBytes == 0;
+    }
+    if (suffix.numChars() > this.numChars()) {
+      return false;
+    }
+    return CollationFactory.getStringSearch(
+      this.substring(this.numChars()-suffix.numChars(), this.numChars()),

Review Comment:
   `this.numChars() - suffix.numChars()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1519546855


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -396,7 +389,18 @@ public boolean startsWith(final UTF8String prefix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().startsWith(prefix.toLowerCase());
     }
-    return matchAt(prefix, 0, collationId);
+    return collatedStartsWith(prefix, collationId);
+  }
+
+  private boolean collatedStartsWith(final UTF8String prefix, int collationId) {

Review Comment:
   SGTM, and +1 to `UTF8StringWithCollationSuite`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1516380909


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -384,27 +387,47 @@ public boolean startsWith(final UTF8String prefix) {
   }
 
   public boolean startsWith(final UTF8String prefix, int collationId) {
-    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+    if (collationId == CollationFactory.DEFAULT_COLLATION_ID) {

Review Comment:
   we shouldn't make changes to the original logic, this part was fine
   in this PR, we should only focus on the "third" type of collations (non-binary, non-lowercase)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "stevomitric (via GitHub)" <gi...@apache.org>.
stevomitric commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1519971782


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -378,13 +378,6 @@ public boolean matchAt(final UTF8String s, int pos) {
     return ByteArrayMethods.arrayEquals(base, offset + pos, s.base, s.offset, s.numBytes);
   }
 
-  private boolean matchAt(final UTF8String s, int pos, int collationId) {
-    if (s.numBytes + pos > numBytes || pos < 0) {
-      return false;
-    }
-    return this.substring(pos, pos + s.numBytes).semanticCompare(s, collationId) == 0;

Review Comment:
   Migrated `collatedStartsWith` and `collatedEndsWith` functions to here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "stevomitric (via GitHub)" <gi...@apache.org>.
stevomitric commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1518576388


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -410,7 +414,19 @@ public boolean endsWith(final UTF8String suffix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().endsWith(suffix.toLowerCase());
     }
-    return matchAt(suffix, numBytes - suffix.numBytes, collationId);
+    return collatedEndsWith(suffix, collationId);
+  }
+
+  private boolean collatedEndsWith(final UTF8String suffix, int collationId) {
+    if (suffix.numBytes == 0 || this.numBytes == 0) {
+      return suffix.numBytes==0;
+    }
+    if (suffix.numChars() > this.numChars()) {
+      return false;
+    }
+    return CollationFactory.getStringSearch(
+            this.substring(this.numChars()-suffix.numChars(), this.numChars()),

Review Comment:
   > identation is a bit weird here. 4 spaces should be more appropriate? Please align with the rest of file.
   
   Followed [Spacing and Indention rules](https://github.com/databricks/scala-style-guide?tab=readme-ov-file#spacing-and-indentation).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "stevomitric (via GitHub)" <gi...@apache.org>.
stevomitric commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1517360690


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -410,7 +412,9 @@ public boolean endsWith(final UTF8String suffix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().endsWith(suffix.toLowerCase());
     }
-    return matchAt(suffix, numBytes - suffix.numBytes, collationId);
+    if (suffix.numBytes == 0 || this.numBytes == 0) return suffix.numBytes==0;
+
+    return CollationFactory.getStringSearch(this, suffix, collationId).last()==this.numChars()-suffix.numChars();

Review Comment:
   Separated both `collatedEndsWith` and `collatedStartsWith`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1519302246


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -396,7 +389,18 @@ public boolean startsWith(final UTF8String prefix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().startsWith(prefix.toLowerCase());
     }
-    return matchAt(prefix, 0, collationId);
+    return collatedStartsWith(prefix, collationId);
+  }
+
+  private boolean collatedStartsWith(final UTF8String prefix, int collationId) {
+    if (prefix.numBytes == 0 || this.numBytes == 0) {
+      return prefix.numBytes==0;

Review Comment:
   `return prefix.numBytes == 0;`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1518050657


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -410,7 +414,19 @@ public boolean endsWith(final UTF8String suffix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().endsWith(suffix.toLowerCase());
     }
-    return matchAt(suffix, numBytes - suffix.numBytes, collationId);
+    return collatedEndsWith(suffix, collationId);
+  }
+
+  private boolean collatedEndsWith(final UTF8String suffix, int collationId) {
+    if (suffix.numBytes == 0 || this.numBytes == 0) {
+      return suffix.numBytes==0;
+    }
+    if (suffix.numChars() > this.numChars()) {
+      return false;
+    }
+    return CollationFactory.getStringSearch(
+            this.substring(this.numChars()-suffix.numChars(), this.numChars()),

Review Comment:
   +1
   
   this might be useful: https://github.com/databricks/scala-style-guide



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1516389365


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -384,27 +387,47 @@ public boolean startsWith(final UTF8String prefix) {
   }
 
   public boolean startsWith(final UTF8String prefix, int collationId) {
-    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+    if (collationId == CollationFactory.DEFAULT_COLLATION_ID) {
       return this.startsWith(prefix);
     }
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().startsWith(prefix.toLowerCase());
     }
-    return matchAt(prefix, 0, collationId);
+    if (prefix.numBytes == 0 || this.numBytes == 0) return prefix.numBytes==0;
+
+    String prefixString = StandardCharsets.UTF_8.decode(prefix.getByteBuffer()).toString(),
+            patternString = StandardCharsets.UTF_8.decode(this.getByteBuffer()).toString();
+
+    StringSearch search = new StringSearch(prefixString,

Review Comment:
   referring to this PR: https://github.com/apache/spark/pull/45382
   you can see that we shouldn't instantiate StringSearch objects outside of CollationFactory as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1516391257


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -31,6 +32,8 @@
 import com.esotericsoftware.kryo.io.Input;
 import com.esotericsoftware.kryo.io.Output;
 
+import com.ibm.icu.text.RuleBasedCollator;
+import com.ibm.icu.text.StringSearch;

Review Comment:
   (see comment below)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1519387286


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -378,13 +378,6 @@ public boolean matchAt(final UTF8String s, int pos) {
     return ByteArrayMethods.arrayEquals(base, offset + pos, s.base, s.offset, s.numBytes);
   }
 
-  private boolean matchAt(final UTF8String s, int pos, int collationId) {
-    if (s.numBytes + pos > numBytes || pos < 0) {
-      return false;
-    }
-    return this.substring(pos, pos + s.numBytes).semanticCompare(s, collationId) == 0;

Review Comment:
   We can, that's exactly what Stevo is doing - removing this code and replacing it with `CollationFactory.getStringSearch`. For more context: the old implementation was there before we introduced `StringSearch`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL] Added ICU StringSearch for the `startsWith` and `endsWith` functions [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #45421: [SPARK-47295][SQL] Added ICU StringSearch for the `startsWith` and `endsWith` functions
URL: https://github.com/apache/spark/pull/45421


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1516381847


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -384,27 +387,47 @@ public boolean startsWith(final UTF8String prefix) {
   }
 
   public boolean startsWith(final UTF8String prefix, int collationId) {
-    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+    if (collationId == CollationFactory.DEFAULT_COLLATION_ID) {
       return this.startsWith(prefix);
     }
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().startsWith(prefix.toLowerCase());
     }
-    return matchAt(prefix, 0, collationId);
+    if (prefix.numBytes == 0 || this.numBytes == 0) return prefix.numBytes==0;
+
+    String prefixString = StandardCharsets.UTF_8.decode(prefix.getByteBuffer()).toString(),
+            patternString = StandardCharsets.UTF_8.decode(this.getByteBuffer()).toString();
+
+    StringSearch search = new StringSearch(prefixString,
+            new StringCharacterIterator(patternString),
+            (RuleBasedCollator) CollationFactory.fetchCollation(collationId).collator
+    );
+
+    return search.first()==0;
   }
 
   public boolean endsWith(final UTF8String suffix) {
     return matchAt(suffix, numBytes - suffix.numBytes);
   }
 
   public boolean endsWith(final UTF8String suffix, int collationId) {
-    if (CollationFactory.fetchCollation(collationId).isBinaryCollation) {
+    if (collationId == CollationFactory.DEFAULT_COLLATION_ID) {

Review Comment:
   (see comment above)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1517226969


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -396,7 +396,9 @@ public boolean startsWith(final UTF8String prefix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().startsWith(prefix.toLowerCase());
     }
-    return matchAt(prefix, 0, collationId);

Review Comment:
   if there's code in `UTF8String` that's no longer used, we should delete it (for example, the overloaded `matchAt`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47295][SQL][COLLATION] Added ICU StringSearch for 'startsWith' and 'endsWith' functions [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45421:
URL: https://github.com/apache/spark/pull/45421#discussion_r1517976807


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -410,7 +414,19 @@ public boolean endsWith(final UTF8String suffix, int collationId) {
     if (collationId == CollationFactory.LOWERCASE_COLLATION_ID) {
       return this.toLowerCase().endsWith(suffix.toLowerCase());
     }
-    return matchAt(suffix, numBytes - suffix.numBytes, collationId);
+    return collatedEndsWith(suffix, collationId);
+  }
+
+  private boolean collatedEndsWith(final UTF8String suffix, int collationId) {
+    if (suffix.numBytes == 0 || this.numBytes == 0) {
+      return suffix.numBytes==0;

Review Comment:
   same for other places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org