You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/23 22:22:38 UTC

[GitHub] [spark] vitaliili-db opened a new pull request, #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

vitaliili-db opened a new pull request, #37631:
URL: https://github.com/apache/spark/pull/37631

   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Special case for `split` function when `regex` parameter is empty. The result will split input string but avoid empty remainder/tail string. I.e. `split('hello', '')` will produce `['h', 'e', 'l', 'l', 'o']` instead of `['h', 'e', 'l', 'l', 'o', '']`. Old behavior is preserved when `limit` parameter is explicitly set to negative value - `split('hello', '', -1)`
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   This is nice to have and matches intuitively expected behavior. 
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   Yes. 
   - `split` function output changes when empty `regex` parameter is provided and `limit` parameter is not specified or set to 0. In this case a trailing empty string is ignored.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   Unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37631:
URL: https://github.com/apache/spark/pull/37631#discussion_r954000072


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,7 @@ license: |
   - Since Spark 3.4, v1 database, table, permanent view and function identifier will include 'spark_catalog' as the catalog name if database is defined, e.g. a table identifier will be: `spark_catalog.default.t`. To restore the legacy behavior, set `spark.sql.legacy.v1IdentifierNoCatalog` to `true`.
   - Since Spark 3.4, when ANSI SQL mode(configuration `spark.sql.ansi.enabled`) is on, Spark SQL always returns NULL result on getting a map value with a non-existing key. In Spark 3.3 or earlier, there will be an error.
   - Since Spark 3.4, the SQL CLI `spark-sql` does not print the prefix `Error in query:` before the error message of `AnalysisException`.
+  - Since Spark 3.4, `split` function ignores trailing empty strings when `regex` parameter is empty and `limit` parameter is not specified or set to 0 or -1.

Review Comment:
   ```suggestion
     - Since Spark 3.4, `split` function ignores trailing empty strings when `regex` parameter is empty and `limit` parameter is not specified or set to a non-positive number.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] vitaliili-db commented on pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
vitaliili-db commented on PR #37631:
URL: https://github.com/apache/spark/pull/37631#issuecomment-1230981751

   @cloud-fan thank you, that is a great suggestion. Done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] vitaliili-db commented on a diff in pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
vitaliili-db commented on code in PR #37631:
URL: https://github.com/apache/spark/pull/37631#discussion_r953973254


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##########
@@ -527,7 +527,7 @@ case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   override def second: Expression = regex
   override def third: Expression = limit
 
-  def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1));
+  def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(0))

Review Comment:
   Yes, you are correct. This should not change. Reverted.



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1000,6 +1000,17 @@ public static UTF8String concatWs(UTF8String separator, UTF8String... inputs) {
   }
 
   public UTF8String[] split(UTF8String pattern, int limit) {
+    // This is a special case for converting string into array of symbols without a trailing empty
+    // string. E.g. `"hello".split("", 0) => ["h", "e", "l", "l", "o"].
+    // Note that negative limit will preserve a trailing empty string.
+    if (limit == 0 && numBytes() != 0 && pattern.numBytes() == 0) {
+      byte[] input = getBytes();
+      UTF8String[] result = new UTF8String[numBytes];
+      for (int i = 0; i < numBytes; i++) {

Review Comment:
   Great catch, fixed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37631:
URL: https://github.com/apache/spark/pull/37631#discussion_r953400215


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1000,6 +1000,17 @@ public static UTF8String concatWs(UTF8String separator, UTF8String... inputs) {
   }
 
   public UTF8String[] split(UTF8String pattern, int limit) {
+    // This is a special case for converting string into array of symbols without a trailing empty
+    // string. E.g. `"hello".split("", 0) => ["h", "e", "l", "l", "o"].
+    // Note that negative limit will preserve a trailing empty string.
+    if (limit == 0 && numBytes() != 0 && pattern.numBytes() == 0) {
+      byte[] input = getBytes();
+      UTF8String[] result = new UTF8String[numBytes];
+      for (int i = 0; i < numBytes; i++) {

Review Comment:
   are we sure it's one byte per character?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] vitaliili-db commented on pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
vitaliili-db commented on PR #37631:
URL: https://github.com/apache/spark/pull/37631#issuecomment-1224996770

   @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37631:
URL: https://github.com/apache/spark/pull/37631#issuecomment-1229682384

   ah, now I understand where the trailing empty string comes from, thanks! I agree to go with the "split and drop trailing empty string" behavior, but can we do it regardless of the limit parameter? I'd expect something like
   ```
   if (split == "") {
     str.toSeq.take(limit)
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37631:
URL: https://github.com/apache/spark/pull/37631#issuecomment-1226707879

   This seems a bit inconsistent. According to the function doc, `-1` means no limit, and it's confusing why no limit is different from a large enough limit (which means no limit as well).
   
   I'd consider trailing empty string as a bug, can we fix it in all the cases?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] vitaliili-db commented on a diff in pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
vitaliili-db commented on code in PR #37631:
URL: https://github.com/apache/spark/pull/37631#discussion_r958704810


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1000,6 +1000,22 @@ public static UTF8String concatWs(UTF8String separator, UTF8String... inputs) {
   }
 
   public UTF8String[] split(UTF8String pattern, int limit) {
+    // This is a special case for converting string into array of symbols without a trailing empty
+    // string. E.g. `"hello".split("", 0) => ["h", "e", "l", "l", "o"].
+    // Note that negative limit will preserve a trailing empty string.

Review Comment:
   Done, thank you for reviewing!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] vitaliili-db commented on pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
vitaliili-db commented on PR #37631:
URL: https://github.com/apache/spark/pull/37631#issuecomment-1228950512

   @cloud-fan trailing empty string is not actually a bug, it has consistent behavior in all systems, e.g. `split("aaAbbAccA", "A")` gives same result in all systems => `['aa', 'bb', 'cc', '']`.  So were are pretty consistent here. The only difference in absence of `limit` parameter (e.g. for migration purposes) is how empty delimiter/regex behaves. As mentioned we might choose either ignore regex (return original string), split and drop trailing empty string (this PR) or do nothing and return array with trailing empty string. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37631:
URL: https://github.com/apache/spark/pull/37631#issuecomment-1225936870

   What should be the behavior of `split('hello', '', 100)`? Do other databases return a trailing empty string?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37631:
URL: https://github.com/apache/spark/pull/37631#issuecomment-1225920329

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] vitaliili-db commented on pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
vitaliili-db commented on PR #37631:
URL: https://github.com/apache/spark/pull/37631#issuecomment-1226095179

   @cloud-fan other databases don't have `limit` parameter in `split` function. In addition a second parameter is a string in other systems as opposed to regex in Spark. Closest method we have is `StringSplitSQL` but we cannot replace it since it will be bigger breaking change for users who use `limit` parameter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37631:
URL: https://github.com/apache/spark/pull/37631#discussion_r957940422


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1000,6 +1000,22 @@ public static UTF8String concatWs(UTF8String separator, UTF8String... inputs) {
   }
 
   public UTF8String[] split(UTF8String pattern, int limit) {
+    // This is a special case for converting string into array of symbols without a trailing empty
+    // string. E.g. `"hello".split("", 0) => ["h", "e", "l", "l", "o"].
+    // Note that negative limit will preserve a trailing empty string.

Review Comment:
   We need to update this comment now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.
URL: https://github.com/apache/spark/pull/37631


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37631:
URL: https://github.com/apache/spark/pull/37631#issuecomment-1232655094

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37631:
URL: https://github.com/apache/spark/pull/37631#discussion_r953401126


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##########
@@ -527,7 +527,7 @@ case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   override def second: Expression = regex
   override def third: Expression = limit
 
-  def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1));
+  def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(0))

Review Comment:
   ideally 0 and -1 should be no difference according to the function doc. Do you use -1 as a legacy flag?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org