You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "dbatomic (via GitHub)" <gi...@apache.org> on 2023/12/01 14:39:35 UTC

[PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]

dbatomic opened a new pull request, #44110:
URL: https://github.com/apache/spark/pull/44110

Link to [JIRA](https://issues.apache.org/jira/browse/SPARK-46173).

### What changes were proposed in this pull request?

This PR is a response to a user complaint that stated that date to string casts are too slow if input string has many whitespace/isocontrol character as it's prefix or sufix.

Current behaviour is that in StringToDate function we first call trimAll function to remove any whitespace/isocontrol chars. TrimAll creates new string and then we work against that string. This is not really needed since StringToDate can just simply skip these characters and do all the processing in single loop without extra allocations. Proposed fix is exactly that.

### Why are the changes needed?

These changes should drastically improve edge case where input string in cast to date has many whitespace as prefix/sufix.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Extending existing tests to cover both prefix and suffix cases with whitespaces/control chars.

The change also includes a small unit benchmark for this particular case. Looking for feedback if we want to keep the benchmark, given that this is a rather esoteric edge case.

The benchmark measures time needed to do 10k calls of stringToDate function against a input strings that are in format <65k whitespace><correct_date><65 whitespace> and <correct_date><65k whitespace>. Here are the results:

| input example | prior to change | after the change |
| -------------- | ------------------ | ------------- |
| 65k whitespace prefix + 65k suffix | 3250ms | 484ms |
| 65k whitespace suffix | 1572ms | <1ms |

### Was this patch authored or co-authored using generative AI tooling?

No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413764128


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }

Review Comment:
   Please ignore the above comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1845566488

   > @dbatomic Does this suffer from the same issue?
   > 
   > https://github.com/apache/spark/blob/b4d90dd2a622beda542a7ce1ee15af9f312f9724/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L429
   > 
   > If so, could you optimize it too.
   
   Sure, good catch. I will follow up with PR that takes care of this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #44110: [SPARK-46173][SQL] Skipping trimAll call during date parsing
URL: https://github.com/apache/spark/pull/44110


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1418184528


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {

Review Comment:
   No. I origin suggestion is create two variable `strLen` and `strEndTrimmed`. The suggestion is unrelated to rename `strEndTrimmed`. Of course, we decide use bytes.lenth directly instead of create `strLen`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1415543362


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,41 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j == bytes.length) {
+      return None;
+    }
+
+    var strlen = bytes.length
+    while (strlen > j && UTF8String.isWhitespaceOrISOControl(bytes(strlen - 1))) {
+      strlen -= 1;
+    }
+
+    if (j >= strlen)
+    {
+      return None

Review Comment:
   Is this path reachable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413748213


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j == bytes.length) {
+      return None;
+    }
+
     if (bytes(j) == '-' || bytes(j) == '+') {
       sign = if (bytes(j) == '-') -1 else 1
       j += 1
     }
-    while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+    while (j < bytes.length &&
+      (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {

Review Comment:
   oh is this a behavior change? now not only spaces, but also ISO controls are allowed in the middle.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1416987444


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length

Review Comment:
   Yeah, I also don't think that we need two variables. I can call it strLen or strLenTrimmed or anything along those lines, if that sounds better to you guys.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1416657435


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length

Review Comment:
   `bytes.length` is very cheap and clear. It doesn't seem helpful to create a variable for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1416657240


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {

Review Comment:
   how?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1836615208

   @dbatomic Could you remove this Link to [JIRA](https://issues.apache.org/jira/browse/SPARK-46173) from PR's description (PR's title has it already), and add the `[SQL]` tag to the title.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413737755


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes

Review Comment:
   We can reuse the `trimLeft` here.



##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }

Review Comment:
   After using `trimLeft`, we can delete the three lines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1414260156


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j == bytes.length) {
+      return None;
+    }
+
     if (bytes(j) == '-' || bytes(j) == '+') {
       sign = if (bytes(j) == '-') -1 else 1
       j += 1
     }
-    while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+    while (j < bytes.length &&
+      (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {

Review Comment:
   Let's not change any behavior but only improve the perf. i.e. only allow ' ' and 'T' in the middle.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1412705078


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -502,7 +502,7 @@ private UTF8String copyUTF8String(int start, int end) {
    * Determines if the specified character (Unicode code point) is white space or an ISO control
    * character according to Java.
    */
-  private boolean isWhitespaceOrISOControl(int codePoint) {
+  public static boolean isWhitespaceOrISOControl(int codePoint) {

Review Comment:
   Because the visibility changed, please move `isWhitespaceOrISOControl` above the constructor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1838267647

   > > The change also includes a small unit benchmark for this particular case.
   > 
   > I wonder of other benchmarks. Do you observe perf regressions? I am asking just in case.
   
   I just ran Parsing benchmarks locally. Here are the results:
   
   With trimming changes:
   func_name:                                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   to timestamp str wholestage on                       85             87           2         11.7          85.2       1.1X
   to_timestamp wholestage on                          421            423           3          2.4         420.7       1.0X
   to_unix_timestamp wholestage on                     418            423           4          2.4         417.9       1.0X
   to date str wholestage on                            98            101           1         10.2          98.3       0.9X
   to_date wholestage on                               612            616           4          1.6         611.8       1.0X
   to_date_with_trimming                               758            763           7          0.0    75792262.5       1.0X
   to_date_with_trimming_only_suffix                     0              1           0          0.0       44237.5    1713.3X
   
   --
   Without my change
   func_name:                                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   to timestamp str wholestage on                       82             83           1         12.2          82.1       1.2X
   to_timestamp wholestage on                          418            421           2          2.4         418.1       0.9X
   to_unix_timestamp wholestage on                     414            416           2          2.4         413.9       0.9X
   to date str wholestage on                           101            103           2          9.9         101.4       0.9X
   to_date wholestage on                               620            624           4          1.6         620.3       1.0X
   
   TLDR - no observable change. One some functions it is <1% worse and somewhere it is 1% better, but again, these are just tests on my local machine, without any real isolation so tiny diff is expected.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413761230


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j == bytes.length) {
+      return None;
+    }
+
     if (bytes(j) == '-' || bytes(j) == '+') {
       sign = if (bytes(j) == '-') -1 else 1
       j += 1
     }
-    while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+    while (j < bytes.length &&
+      (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {

Review Comment:
   That's right. E.g. "2015-03-18\u0003123142" will be correctly parsed with this change.
   
   I feel like we need to identify the proper behaviour here, since currently situation is pretty chaotic.
   
   As of this change we have following rules:
   1) string can have any number of whitespace or iso controls in the beginning or end. All of these will be ignored.
   2) Date and time parts can be split by either whitespace, isocontrol or 'T' char. 
   
   If we feel like this is not ok (date and time can be separated only by ' ' and 'T', I can think of something to enforce this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1836239754

   @MaxGekk -> Looking for your feedback on this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1843090919

   The docker integration test failure is unrelated. merging to master, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1845496117

   @dbatomic Does this suffer from the same issue?
   https://github.com/apache/spark/blob/b4d90dd2a622beda542a7ce1ee15af9f312f9724/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L429
   
   If so, could you optimize it too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413582904


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
-    if (bytes(j) == '-' || bytes(j) == '+') {
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j < bytes.length && (bytes(j) == '-' || bytes(j) == '+')) {
       sign = if (bytes(j) == '-') -1 else 1
       j += 1
     }
-    while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+    while ( j < bytes.length &&

Review Comment:
   Just in case, here is the style guide: https://github.com/databricks/scala-style-guide



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1838423819

   > > Looking for feedback if we want to keep the benchmark, given that this is a rather esoteric edge case.
   > 
   > I am ok to exclude the benchmark for the pretty specific case from the PR.
   
   I decided to exclude the benchmark - thinking is that there is no point in having a benchmark for such an edge case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413750468


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j == bytes.length) {
+      return None;
+    }
+
     if (bytes(j) == '-' || bytes(j) == '+') {
       sign = if (bytes(j) == '-') -1 else 1
       j += 1
     }
-    while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+    while (j < bytes.length &&
+      (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {

Review Comment:
   Maybe we should calculate the end offset (not always `bytes.length`) at the beginning, by excluding white spaces and ISO controls from the end.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413744707


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes

Review Comment:
   According to the PR description, we don't want to use these trim methods to avoid string copy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413574114


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
-    if (bytes(j) == '-' || bytes(j) == '+') {
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j < bytes.length && (bytes(j) == '-' || bytes(j) == '+')) {
       sign = if (bytes(j) == '-') -1 else 1
       j += 1
     }
-    while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+    while ( j < bytes.length &&
+      (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {

Review Comment:
   ok, I see. Let's preserve the behaviour.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413561777


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
-    if (bytes(j) == '-' || bytes(j) == '+') {
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j < bytes.length && (bytes(j) == '-' || bytes(j) == '+')) {
       sign = if (bytes(j) == '-') -1 else 1
       j += 1
     }
-    while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+    while ( j < bytes.length &&
+      (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {

Review Comment:
   I had a conversation on this with @cloud-fan  last week. The thing is that, for example, following string will be parsed just fine `toDate("2015-01-28 someRandomStuffAtTheEnd")`. I proposed to change this behaviour but @cloud-fan  pushed back and noted that it is too late to change this functionality and that this bug is more of a feature at this point :)
   
   Hence, we can't just trim trailing whitespaces and think that we are done since there might be whitespaces in the middle.
   
   Also @cloud-fan , if we are sure that we want to keep this behaviour I would like to at least add a test that verifies that   `toDate("<correct_date> <some gibberish>")` will be correctly parsed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413654244


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1417100176


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length

Review Comment:
   I'm OK keep it unchanged here too.
   Yeah. `bytes.length` is cheap.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413760894


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes

Review Comment:
   Got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1412447410


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {

Review Comment:
   So, actually we could still have the optimization if you add an if after:
   ```scala
       while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
         j += 1;
       }
       if (j == bytes.length) {
         return None;
       }
   ```



##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
-    if (bytes(j) == '-' || bytes(j) == '+') {
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j < bytes.length && (bytes(j) == '-' || bytes(j) == '+')) {
       sign = if (bytes(j) == '-') -1 else 1
       j += 1
     }
-    while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+    while ( j < bytes.length &&

Review Comment:
   Could you remove the gap and leave this part as it was:
   ```suggestion
       while (j < bytes.length &&
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1417024325


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length

Review Comment:
   Btw, we don't need strLen/bytes.length at all after we get strEndTrimmed var. We don't care about anything after strEndTrimmed in this function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1417096351


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {

Review Comment:
   Please see
   https://github.com/apache/spark/pull/44110#discussion_r1416537457
   What I mean is reuse `strLen`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1417100176


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length

Review Comment:
   I'm OK keep it unchanged here too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1415790402


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,41 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j == bytes.length) {
+      return None;
+    }
+
+    var strlen = bytes.length
+    while (strlen > j && UTF8String.isWhitespaceOrISOControl(bytes(strlen - 1))) {
+      strlen -= 1;
+    }
+
+    if (j >= strlen)
+    {
+      return None

Review Comment:
   good catch, thanks! Removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1416536094


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+    if (j == bytes.length) {

Review Comment:
   ditto.



##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {

Review Comment:
   Why not reuse `strEndTrimmed` here?



##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length

Review Comment:
   I like the previous name, it was more universal. e.g. `strLen` or `len`.



##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length

Review Comment:
   How about
   ```
   val strLen = bytes.length
   var strEndTrimmed = strLen
   ```
   We keep `strLen` immutable and reuse it later. `strEndTrimmed` used as mutable variable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1838402901

   > Looking for feedback if we want to keep the benchmark, given that this is a rather esoteric edge case.
   
   I am ok to exclude the benchmark for the pretty specific case from the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44110:
URL: https://github.com/apache/spark/pull/44110#issuecomment-1837229888

   > The change also includes a small unit benchmark for this particular case.
   
   I wonder of other benchmarks. Do you observe perf regressions? I am asking just in case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173] Skipping trimAll call during date parsing [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1412486753


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,28 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
-    if (bytes(j) == '-' || bytes(j) == '+') {
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j < bytes.length && (bytes(j) == '-' || bytes(j) == '+')) {
       sign = if (bytes(j) == '-') -1 else 1
       j += 1
     }
-    while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+    while ( j < bytes.length &&
+      (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {

Review Comment:
   It would be nice to exclude the check `UTF8String.isWhitespaceOrISOControl` from the main loop. In regular cases, dates don't have any spaces, so, don't need to check every char - isn't a whitespace or not. Can we just inline `trimAll` at the beginning? like:
   ```scala
       var j = 0
   
       while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) j += 1
       var end = bytes.length - 1
       while (j < end && UTF8String.isWhitespaceOrISOControl(bytes(end))) end -= 1
       if (j >= end) return None
   
       if (bytes(j) == '-' || bytes(j) == '+') {
         sign = if (bytes(j) == '-') -1 else 1
         j += 1
       }
       while (j < end && (i < 3 && !(bytes(j) == 'T'))) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1413746879


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j == bytes.length) {
+      return None;
+    }
+
     if (bytes(j) == '-' || bytes(j) == '+') {
       sign = if (bytes(j) == '-') -1 else 1
       j += 1
     }
-    while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+    while (j < bytes.length &&
+      (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {

Review Comment:
   let's add a comment to explain the behavior: spaces in the middle may be allowed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1415381396


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,32 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {
+      j += 1;
+    }
+
+    if (j == bytes.length) {
+      return None;
+    }
+
     if (bytes(j) == '-' || bytes(j) == '+') {
       sign = if (bytes(j) == '-') -1 else 1
       j += 1
     }
-    while (j < bytes.length && (i < 3 && !(bytes(j) == ' ' || bytes(j) == 'T'))) {
+    while (j < bytes.length &&
+      (i < 3 && !(UTF8String.isWhitespaceOrISOControl(bytes(j)) || bytes(j) == 'T'))) {

Review Comment:
   Implemented the approach with end offset. Please take a look.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1417456888


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {

Review Comment:
   so, you are basically suggesting to rename strEndTrimmed back to strLen?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46173][SQL] Skipping trimAll call during date parsing [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44110:
URL: https://github.com/apache/spark/pull/44110#discussion_r1417489483


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
       (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
         (segment != 0 && digits > 0 && digits <= 2)
     }
-    if (s == null || s.trimAll().numBytes() == 0) {
+    if (s == null) {
       return None
     }
+
     val segments: Array[Int] = Array[Int](1, 1, 1)
     var sign = 1
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
-    val bytes = s.trimAll().getBytes
+    val bytes = s.getBytes
     var j = 0
+    var strEndTrimmed = bytes.length
+
+    while (j < bytes.length && UTF8String.isWhitespaceOrISOControl(bytes(j))) {

Review Comment:
   `strEndTrimmed` is more accurate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org