You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/14 16:23:29 UTC

[GitHub] [spark] bersprockets opened a new pull request, #37513: [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences

bersprockets opened a new pull request, #37513:
URL: https://github.com/apache/spark/pull/37513

### What changes were proposed in this pull request?

Add code to defensively check if the pre-allocated result array is big enough to handle the next element in a date or timestamp sequence.

### Why are the changes needed?

`InternalSequenceBase.getSequenceLength` is a fast method for estimating the size of the result array. It uses an estimated step size in micros which is not always entirely accurate for the date/time/time-zone combination. As a result, `getSequenceLength` occasionally overestimates the size of the result array and also occasionally underestimates the size of the result array.

`getSequenceLength` sometimes overestimates the size of the result array when the step size is in months (because `InternalSequenceBase` assumes 28 days per month). This case is handled: `InternalSequenceBase` will slice the array, if needed.

`getSequenceLength` sometimes understimates the size of the result array when the sequence crosses a DST "spring forward" without a corresponding "fall backward". This case is not handled (thus, this PR).

For example:
```
select sequence(
timestamp'2022-03-13 00:00:00',
timestamp'2022-03-14 00:00:00',
interval 1 day) as x;
```
In the America/Los_Angeles time zone, this results in the following error:
```
java.lang.ArrayIndexOutOfBoundsException: 1
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:77)
```
This happens because `InternalSequenceBase` calculates an estimated step size of 24 hours. If you add 24 hours to 2022-03-13 00:00:00 in the America/Los_Angeles time zone, you get 2022-03-14 01:00:00 (because 2022-03-13 has only 23 hours due to "spring forward"). Since 2022-03-14 01:00:00 is later than the specified stop value, `getSequenceLength` assumes the stop value is not included in the result. Therefore, `getSequenceLength` estimates an array size of 1.

However, when actually creating the sequence, `InternalSequenceBase` does not use a step of 24 hours, but of 1 day. When you add 1 day to 2022-03-13 00:00:00, you get 2022-03-14 00:00:00. Now the stop value *is* included, and we overrun the end of the result array.

The new unit test includes examples of problematic date sequences.

This PR adds code to to handle the underestimation case: it checks if we're about to overrun the array, and if so, gets a new array that's larger by 1.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org