You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "dbatomic (via GitHub)" <gi...@apache.org> on 2023/12/01 14:39:35 UTC
[PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]
dbatomic opened a new pull request, #44110:
URL: https://github.com/apache/spark/pull/44110
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a faster review.
7. If you want to add a new configuration, please read the guideline first for naming configurations in
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
8. If you want to add or modify an error type or message, please read the guideline first in
'core/src/main/resources/error/README.md'.
-->
Link to [JIRA](https://issues.apache.org/jira/browse/SPARK-46173).
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
This PR is a response to a user complaint that stated that date to string casts are too slow if input string has many whitespace/isocontrol character as it's prefix or sufix.
Current behaviour is that in StringToDate function we first call trimAll function to remove any whitespace/isocontrol chars. TrimAll creates new string and then we work against that string. This is not really needed since StringToDate can just simply skip these characters and do all the processing in single loop without extra allocations. Proposed fix is exactly that.
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
These changes should drastically improve edge case where input string in cast to date has many whitespace as prefix/sufix.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
No
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
-->
Extending existing tests to cover both prefix and suffix cases with whitespaces/control chars.
The change also includes a small unit benchmark for this particular case. Looking for feedback if we want to keep the benchmark, given that this is a rather esoteric edge case.
The benchmark measures time needed to do 10k calls of stringToDate function against a input strings that are in format <65k whitespace><correct_date><65 whitespace> and <correct_date><65k whitespace>. Here are the results:
| input example | prior to change | after the change |
| -------------- | ------------------ | ------------- |
| 65k whitespace prefix + 65k suffix | 3250ms | 484ms |
| 65k whitespace suffix | 1572ms | <1ms |
### Was this patch authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this patch, please include the
phrase: 'Generated-by: ' followed by the name of the tool and its version.
If no, write 'No'.
Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
-->
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413764128
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
Review Comment:
Please ignore the above comment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1845566488
> @dbatomic Does this suffer from the same issue?
>
> https://github.com/apache/spark/blob/b4d90dd2a622beda542a7ce1ee15af9f312f9724/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L429
>
> If so, could you optimize it too.
Sure, good catch. I will follow up with PR that takes care of this one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #44110: [SPARK-46173][SQL] Skipping trimAll call during date parsing
URL: https://github.com/apache/spark/pull/44110
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1418184528
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
Review Comment:
No. I origin suggestion is create two variable `strLen` and `strEndTrimmed`. The suggestion is unrelated to rename `strEndTrimmed`. Of course, we decide use bytes.lenth directly instead of create `strLen`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1415543362
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,41 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j == bytes.length) {
+ return None;
+ }
+
+ var strlen = bytes.length
+ while (strlen > j && UTF8String.isWhitespaceOrISOControl(bytes(strlen - 1))) {
+ strlen -= 1;
+ }
+
+ if (j >= strlen)
+ {
+ return None
Review Comment:
Is this path reachable?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413748213
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j == bytes.length) {
+ return None;
+ }
+
if (bytes(j) == '-' || bytes(j) == '+') {
sign = if (bytes(j) == '-') -1 else 1
j += 1
}
- while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+ while (j < bytes.length &&
+ (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {
Review Comment:
oh is this a behavior change? now not only spaces, but also ISO controls are allowed in the middle.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1416987444
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
Review Comment:
Yeah, I also don't think that we need two variables. I can call it strLen or strLenTrimmed or anything along those lines, if that sounds better to you guys.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1416657435
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
Review Comment:
`bytes.length` is very cheap and clear. It doesn't seem helpful to create a variable for it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1416657240
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
Review Comment:
how?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1836615208
@dbatomic Could you remove this Link to [JIRA](https://issues.apache.org/jira/browse/SPARK-46173) from PR's description (PR's title has it already), and add the `[SQL]` tag to the title.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413737755
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
Review Comment:
We can reuse the `trimLeft` here.
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
Review Comment:
After using `trimLeft`, we can delete the three lines.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1414260156
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j == bytes.length) {
+ return None;
+ }
+
if (bytes(j) == '-' || bytes(j) == '+') {
sign = if (bytes(j) == '-') -1 else 1
j += 1
}
- while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+ while (j < bytes.length &&
+ (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {
Review Comment:
Let's not change any behavior but only improve the perf. i.e. only allow ' ' and 'T' in the middle.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]
Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1412705078
##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -502,7 +502,7 @@ private UTF8String copyUTF8String(int start, int end) {
* Determines if the specified character (Unicode code point) is white space or an ISO control
* character according to Java.
*/
- private boolean isWhitespaceOrISOControl(int codePoint) {
+ public static boolean isWhitespaceOrISOControl(int codePoint) {
Review Comment:
Because the visibility changed, please move `isWhitespaceOrISOControl` above the constructor.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1838267647
> > The change also includes a small unit benchmark for this particular case.
>
> I wonder of other benchmarks. Do you observe perf regressions? I am asking just in case.
I just ran Parsing benchmarks locally. Here are the results:
With trimming changes:
func_name: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
to timestamp str wholestage on 85 87 2 11.7 85.2 1.1X
to_timestamp wholestage on 421 423 3 2.4 420.7 1.0X
to_unix_timestamp wholestage on 418 423 4 2.4 417.9 1.0X
to date str wholestage on 98 101 1 10.2 98.3 0.9X
to_date wholestage on 612 616 4 1.6 611.8 1.0X
to_date_with_trimming 758 763 7 0.0 75792262.5 1.0X
to_date_with_trimming_only_suffix 0 1 0 0.0 44237.5 1713.3X
--
Without my change
func_name: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
to timestamp str wholestage on 82 83 1 12.2 82.1 1.2X
to_timestamp wholestage on 418 421 2 2.4 418.1 0.9X
to_unix_timestamp wholestage on 414 416 2 2.4 413.9 0.9X
to date str wholestage on 101 103 2 9.9 101.4 0.9X
to_date wholestage on 620 624 4 1.6 620.3 1.0X
TLDR - no observable change. One some functions it is <1% worse and somewhere it is 1% better, but again, these are just tests on my local machine, without any real isolation so tiny diff is expected.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413761230
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j == bytes.length) {
+ return None;
+ }
+
if (bytes(j) == '-' || bytes(j) == '+') {
sign = if (bytes(j) == '-') -1 else 1
j += 1
}
- while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+ while (j < bytes.length &&
+ (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {
Review Comment:
That's right. E.g. "2015-03-18\u0003123142" will be correctly parsed with this change.
I feel like we need to identify the proper behaviour here, since currently situation is pretty chaotic.
As of this change we have following rules:
1) string can have any number of whitespace or iso controls in the beginning or end. All of these will be ignored.
2) Date and time parts can be split by either whitespace, isocontrol or 'T' char.
If we feel like this is not ok (date and time can be separated only by ' ' and 'T', I can think of something to enforce this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]
Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1836239754
@MaxGekk -> Looking for your feedback on this one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1843090919
The docker integration test failure is unrelated. merging to master, thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1845496117
@dbatomic Does this suffer from the same issue?
https://github.com/apache/spark/blob/b4d90dd2a622beda542a7ce1ee15af9f312f9724/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L429
If so, could you optimize it too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413582904
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
- if (bytes(j) == '-' || bytes(j) == '+') {
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j < bytes.length && (bytes(j) == '-' || bytes(j) == '+')) {
sign = if (bytes(j) == '-') -1 else 1
j += 1
}
- while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+ while ( j < bytes.length &&
Review Comment:
Just in case, here is the style guide: https://github.com/databricks/scala-style-guide
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1838423819
> > Looking for feedback if we want to keep the benchmark, given that this is a rather esoteric edge case.
>
> I am ok to exclude the benchmark for the pretty specific case from the PR.
I decided to exclude the benchmark - thinking is that there is no point in having a benchmark for such an edge case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413750468
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j == bytes.length) {
+ return None;
+ }
+
if (bytes(j) == '-' || bytes(j) == '+') {
sign = if (bytes(j) == '-') -1 else 1
j += 1
}
- while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+ while (j < bytes.length &&
+ (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {
Review Comment:
Maybe we should calculate the end offset (not always `bytes.length`) at the beginning, by excluding white spaces and ISO controls from the end.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413744707
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
Review Comment:
According to the PR description, we don't want to use these trim methods to avoid string copy.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413574114
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
- if (bytes(j) == '-' || bytes(j) == '+') {
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j < bytes.length && (bytes(j) == '-' || bytes(j) == '+')) {
sign = if (bytes(j) == '-') -1 else 1
j += 1
}
- while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+ while ( j < bytes.length &&
+ (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {
Review Comment:
ok, I see. Let's preserve the behaviour.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413561777
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
- if (bytes(j) == '-' || bytes(j) == '+') {
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j < bytes.length && (bytes(j) == '-' || bytes(j) == '+')) {
sign = if (bytes(j) == '-') -1 else 1
j += 1
}
- while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+ while ( j < bytes.length &&
+ (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {
Review Comment:
I had a conversation on this with @cloud-fan last week. The thing is that, for example, following string will be parsed just fine `toDate("2015-01-28 someRandomStuffAtTheEnd")`. I proposed to change this behaviour but @cloud-fan pushed back and noted that it is too late to change this functionality and that this bug is more of a feature at this point :)
Hence, we can't just trim trailing whitespaces and think that we are done since there might be whitespaces in the middle.
Also @cloud-fan , if we are sure that we want to keep this behaviour I would like to at least add a test that verifies that `toDate("<correct_date> <some gibberish>")` will be correctly parsed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413654244
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
Review Comment:
done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1417100176
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
Review Comment:
I'm OK keep it unchanged here too.
Yeah. `bytes.length` is cheap.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413760894
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
Review Comment:
Got it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1412447410
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
Review Comment:
So, actually we could still have the optimization if you add an if after:
```scala
while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
j += 1;
}
if (j == bytes.length) {
return None;
}
```
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
- if (bytes(j) == '-' || bytes(j) == '+') {
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j < bytes.length && (bytes(j) == '-' || bytes(j) == '+')) {
sign = if (bytes(j) == '-') -1 else 1
j += 1
}
- while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+ while ( j < bytes.length &&
Review Comment:
Could you remove the gap and leave this part as it was:
```suggestion
while (j < bytes.length &&
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1417024325
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
Review Comment:
Btw, we don't need strLen/bytes.length at all after we get strEndTrimmed var. We don't care about anything after strEndTrimmed in this function.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1417096351
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
Review Comment:
Please see
https://github.com/apache/spark/pull/44110#discussion_r1416537457
What I mean is reuse `strLen`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1417100176
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
Review Comment:
I'm OK keep it unchanged here too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1415790402
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,41 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j == bytes.length) {
+ return None;
+ }
+
+ var strlen = bytes.length
+ while (strlen > j && UTF8String.isWhitespaceOrISOControl(bytes(strlen - 1))) {
+ strlen -= 1;
+ }
+
+ if (j >= strlen)
+ {
+ return None
Review Comment:
good catch, thanks! Removed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1416536094
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+ if (j == bytes.length) {
Review Comment:
ditto.
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
Review Comment:
Why not reuse `strEndTrimmed` here?
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
Review Comment:
I like the previous name, it was more universal. e.g. `strLen` or `len`.
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
Review Comment:
How about
```
val strLen = bytes.length
var strEndTrimmed = strLen
```
We keep `strLen` immutable and reuse it later. `strEndTrimmed` used as mutable variable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1838402901
> Looking for feedback if we want to keep the benchmark, given that this is a rather esoteric edge case.
I am ok to exclude the benchmark for the pretty specific case from the PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1837229888
> The change also includes a small unit benchmark for this particular case.
I wonder of other benchmarks. Do you observe perf regressions? I am asking just in case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1412486753
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
- if (bytes(j) == '-' || bytes(j) == '+') {
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j < bytes.length && (bytes(j) == '-' || bytes(j) == '+')) {
sign = if (bytes(j) == '-') -1 else 1
j += 1
}
- while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+ while ( j < bytes.length &&
+ (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {
Review Comment:
It would be nice to exclude the check `UTF8String.isWhitespaceOrISOControl` from the main loop. In regular cases, dates don't have any spaces, so, don't need to check every char - isn't a whitespace or not. Can we just inline `trimAll` at the beginning? like:
```scala
var j = 0
while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) j += 1
var end = bytes.length - 1
while (j < end && UTF8String.isWhitespaceOrISOControl(bytes(end))) end -= 1
if (j >= end) return None
if (bytes(j) == '-' || bytes(j) == '+') {
sign = if (bytes(j) == '-') -1 else 1
j += 1
}
while (j < end && (i < 3 && !(bytes(j) == 'T'))) {
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413746879
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j == bytes.length) {
+ return None;
+ }
+
if (bytes(j) == '-' || bytes(j) == '+') {
sign = if (bytes(j) == '-') -1 else 1
j += 1
}
- while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+ while (j < bytes.length &&
+ (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {
Review Comment:
let's add a comment to explain the behavior: spaces in the middle may be allowed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1415381396
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+ j += 1;
+ }
+
+ if (j == bytes.length) {
+ return None;
+ }
+
if (bytes(j) == '-' || bytes(j) == '+') {
sign = if (bytes(j) == '-') -1 else 1
j += 1
}
- while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+ while (j < bytes.length &&
+ (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {
Review Comment:
Implemented the approach with end offset. Please take a look.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1417456888
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
Review Comment:
so, you are basically suggesting to rename strEndTrimmed back to strLen?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1417489483
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
(segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
(segment != 0 && digits > 0 && digits <= 2)
}
- if (s == null || s.trimAll().numBytes() == 0) {
+ if (s == null) {
return None
}
+
val segments: Array[Int] = Array[Int](1, 1, 1)
var sign = 1
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
- val bytes = s.trimAll().getBytes
+ val bytes = s.getBytes
var j = 0
+ var strEndTrimmed = bytes.length
+
+ while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
Review Comment:
`strEndTrimmed` is more accurate.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org