You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/14 16:23:29 UTC

[GitHub] [spark] bersprockets opened a new pull request, #37513: [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences

bersprockets opened a new pull request, #37513:
URL: https://github.com/apache/spark/pull/37513

   ### What changes were proposed in this pull request?
   
   Add code to defensively check if the pre-allocated result array is big enough to handle the next element in a date or timestamp sequence.
   
   ### Why are the changes needed?
   
   `InternalSequenceBase.getSequenceLength` is a fast method for estimating the size of the result array. It uses an estimated step size in micros which is not always entirely accurate for the date/time/time-zone combination. As a result, `getSequenceLength` occasionally overestimates the size of the result array and also occasionally underestimates the size of the result array.
   
   `getSequenceLength` sometimes overestimates the size of the result array when the step size is in months (because `InternalSequenceBase` assumes 28 days per month). This case is handled: `InternalSequenceBase` will slice the array, if needed.
   
   `getSequenceLength` sometimes understimates the size of the result array when the sequence crosses a DST "spring forward" without a corresponding "fall backward". This case is not handled (thus, this PR).
   
   For example:
   ```
   select sequence(
     timestamp'2022-03-13 00:00:00',
     timestamp'2022-03-14 00:00:00',
     interval 1 day) as x;
   ```
   In the America/Los_Angeles time zone, this results in the following error:
   ```
   java.lang.ArrayIndexOutOfBoundsException: 1
   	at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:77)
   ```
   This happens because `InternalSequenceBase` calculates an estimated step size of 24 hours. If you add 24 hours to 2022-03-13 00:00:00 in the America/Los_Angeles time zone, you get 2022-03-14 01:00:00 (because 2022-03-13 has only 23 hours due to "spring forward"). Since 2022-03-14 01:00:00 is later than the specified stop value, `getSequenceLength` assumes the stop value is not included in the result. Therefore, `getSequenceLength` estimates an array size of 1.
   
   However, when actually creating the sequence, `InternalSequenceBase` does not use a step of 24 hours, but of 1 day. When you add 1 day to 2022-03-13 00:00:00, you get 2022-03-14 00:00:00. Now the stop value *is* included, and we overrun the end of the result array.
   
   The new unit test includes examples of problematic date sequences.
   
   This PR adds code to to handle the underestimation case: it checks if we're about to overrun the array, and if so, gets a new array that's larger by 1.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #37513: [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #37513:
URL: https://github.com/apache/spark/pull/37513#discussion_r946227016


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3055,17 +3055,25 @@ object Sequence {
         val startMicros: Long = toMicros(num.toLong(start), scale)
         val stopMicros: Long = toMicros(num.toLong(stop), scale)
 
-        val maxEstimatedArrayLength =
+        val estimatedArrayLength =
           getSequenceLength(startMicros, stopMicros, input3, intervalStepInMicros)
 
         val stepSign = if (intervalStepInMicros > 0) +1 else -1
         val exclusiveItem = stopMicros + stepSign
-        val arr = new Array[T](maxEstimatedArrayLength)
+        var arr = new Array[T](estimatedArrayLength)
         var t = startMicros
         var i = 0
 
         while (t < exclusiveItem ^ stepSign < 0) {
           val result = fromMicros(t, scale)
+          // if we've underestimated the size of the array, due to crossing a DST
+          // "spring forward" without a corresponding "fall back", make a copy
+          // that's larger by 1
+          if (i == arr.length) {
+            val newArr = new Array[T](estimatedArrayLength + 1)

Review Comment:
   Actually, I can shorten this to `val arr = arr.padTo(...)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk closed pull request #37513: [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences

Posted by GitBox <gi...@apache.org>.
MaxGekk closed pull request #37513: [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences
URL: https://github.com/apache/spark/pull/37513


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #37513: [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on PR #37513:
URL: https://github.com/apache/spark/pull/37513#issuecomment-1216339628

   @bersprockets The changes cause conflicts in branch-3.1. Please, open a PR w/ a backport to 3.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #37513: [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #37513:
URL: https://github.com/apache/spark/pull/37513#discussion_r946505567


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3173,6 +3179,9 @@ object Sequence {
          |  int $i = 0;
          |
          |  while ($t < $exclusiveItem ^ $stepSign < 0) {
+         |    if ($i == $arr.length) {
+         |      $arr = java.util.Arrays.copyOf($arr, $i + 1);

Review Comment:
   I got it. This LGTM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #37513: [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #37513:
URL: https://github.com/apache/spark/pull/37513#discussion_r946227016


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3055,17 +3055,25 @@ object Sequence {
         val startMicros: Long = toMicros(num.toLong(start), scale)
         val stopMicros: Long = toMicros(num.toLong(stop), scale)
 
-        val maxEstimatedArrayLength =
+        val estimatedArrayLength =
           getSequenceLength(startMicros, stopMicros, input3, intervalStepInMicros)
 
         val stepSign = if (intervalStepInMicros > 0) +1 else -1
         val exclusiveItem = stopMicros + stepSign
-        val arr = new Array[T](maxEstimatedArrayLength)
+        var arr = new Array[T](estimatedArrayLength)
         var t = startMicros
         var i = 0
 
         while (t < exclusiveItem ^ stepSign < 0) {
           val result = fromMicros(t, scale)
+          // if we've underestimated the size of the array, due to crossing a DST
+          // "spring forward" without a corresponding "fall back", make a copy
+          // that's larger by 1
+          if (i == arr.length) {
+            val newArr = new Array[T](estimatedArrayLength + 1)

Review Comment:
   Actually, I can shorten this to `val arr = arr.padTo`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #37513: [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #37513:
URL: https://github.com/apache/spark/pull/37513#discussion_r947371052


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3173,6 +3179,9 @@ object Sequence {
          |  int $i = 0;
          |
          |  while ($t < $exclusiveItem ^ $stepSign < 0) {
+         |    if ($i == $arr.length) {
+         |      $arr = java.util.Arrays.copyOf($arr, $i + 1);

Review Comment:
   Actually, you're right, they are different and I think the code genned code is more correct. I followed up with #37542



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #37513: [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #37513:
URL: https://github.com/apache/spark/pull/37513#discussion_r946487512


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3173,6 +3179,9 @@ object Sequence {
          |  int $i = 0;
          |
          |  while ($t < $exclusiveItem ^ $stepSign < 0) {
+         |    if ($i == $arr.length) {
+         |      $arr = java.util.Arrays.copyOf($arr, $i + 1);

Review Comment:
   Why the codegen and "interpret" code is different? If there is no specific reason, please, make them the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #37513: [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #37513:
URL: https://github.com/apache/spark/pull/37513#discussion_r945329087


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3055,17 +3055,25 @@ object Sequence {
         val startMicros: Long = toMicros(num.toLong(start), scale)
         val stopMicros: Long = toMicros(num.toLong(stop), scale)
 
-        val maxEstimatedArrayLength =
+        val estimatedArrayLength =
           getSequenceLength(startMicros, stopMicros, input3, intervalStepInMicros)

Review Comment:
   Another approach I tested was "fudging" the stopMicros by adding 1 hour. That also worked without adding too many extra cases of overestimation (which requires slicing the array on completion). However, that approach works only until some time zone decides to have two "spring forwards" with no intervening "fall backward", or a "spring forward" of more than one hour. Both of those situations are unlikely to occur, but always a possibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #37513: [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on PR #37513:
URL: https://github.com/apache/spark/pull/37513#issuecomment-1216337179

   +1, LGTM. Merging to master, and trying to 3.3, 3.2, 3.1.
   Thank you, @bersprockets and @HyukjinKwon for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org