You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/13 21:45:45 UTC

[GitHub] [spark] bersprockets opened a new pull request, #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences

bersprockets opened a new pull request, #36546:
URL: https://github.com/apache/spark/pull/36546

   ### What changes were proposed in this pull request?
   
   Change `InternalSequenceBase` to pass a time-zone aware value to `DateTimeUtils#timestampAddInterval`, rather than a time-zone agnostic value, when performing `Date` arithmetic.
   
   ### Why are the changes needed?
   
   The following query gets the wrong answer if run in the America/Los_Angeles time zone:
   ```
   spark-sql> select sequence(date '2021-01-01', date '2022-01-01', interval '3' month) x;
   [2021-01-01,2021-03-31,2021-06-30,2021-09-30,2022-01-01]
   Time taken: 0.664 seconds, Fetched 1 row(s)
   spark-sql> 
   ```
   The answer should be
   ```
   [2021-01-01,2021-04-01,2021-07-01,2021-10-01,2022-01-01]
   
   ```
   `InternalSequenceBase` converts the date to micros by multiplying days by micros per day. This converts the date into a time-zone agnostic timestamp. However, `InternalSequenceBase` uses `DateTimeUtils#timestampAddInterval` to perform the arithmetic, and that function assumes a _time-zone aware_ timestamp.
   
   One simple fix would be to call `DateTimeUtils#timestampNTZAddInterval` instead for date arithmetic. However, Spark date arithmetic is typically time-zone aware (see the comment in the test added by this PR), so this PR converts the date to a time-zone aware value before calling `DateTimeUtils#timestampAddInterval`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences

Posted by GitBox <gi...@apache.org>.
bersprockets commented on PR #36546:
URL: https://github.com/apache/spark/pull/36546#issuecomment-1218598749

   A little late, but I opened PR to back-port to 3.1 (another back-port will need it as a prereq).
   
   #37559


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36546:
URL: https://github.com/apache/spark/pull/36546#issuecomment-1126832443

   @bersprockets, it has a conflict with branch-3.1. Please create a PR to backport if you think it should be backported :-).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36546:
URL: https://github.com/apache/spark/pull/36546#issuecomment-1126832378

   Merged to master, branch-3.3 and branch-3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #36546:
URL: https://github.com/apache/spark/pull/36546#discussion_r873062601


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala:
##########
@@ -964,6 +964,50 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
     }
   }
 
+  test("SPARK-37544: Time zone should not affect date sequence with month interval") {
+    outstandingZoneIds.foreach { zid =>
+      DateTimeTestUtils.withDefaultTimeZone(zid) {
+        checkEvaluation(new Sequence(
+          Literal(Date.valueOf("2021-01-01")),
+          Literal(Date.valueOf("2022-01-01")),
+          Literal(stringToInterval("interval 3 month"))),
+          Seq(
+            Date.valueOf("2021-01-01"),
+            Date.valueOf("2021-04-01"),
+            Date.valueOf("2021-07-01"),
+            Date.valueOf("2021-10-01"),
+            Date.valueOf("2022-01-01")))
+      }
+    }
+
+    // however, time zone should still affect sequences generated using hours interval, especially
+    // if the sequence's start-stop includes a "spring forward".
+    // take, for example, the following Spark date arithmetic:

Review Comment:
   A sentence should begin from an upper case letter?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3062,6 +3080,9 @@ object Sequence {
         "org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampNTZAddInterval"
     }
 
+    private val daysToMicrosCode = "org.apache.spark.sql.catalyst.util.DateTimeUtils.daysToMicros"
+    private val microsToDaysCode = "org.apache.spark.sql.catalyst.util.DateTimeUtils.microsToDays"
+

Review Comment:
   Could you introduce a val for DateTimeUtils:
   ```scala
       private val dtu = DateTimeUtils.getClass.getName.stripSuffix("$")
   ```
   and re-use it here (and above):
   ```scala
       private val addIntervalCode = outerDataType match {
         case TimestampType | DateType => s"$dtu.timestampAddInterval"
         case TimestampNTZType => s"$dtu.timestampNTZAddInterval"
       }
   
       private val daysToMicrosCode = s"$dtu.daysToMicros"
       private val microsToDaysCode = s"$dtu.microsToDays"
   ```
   Look at datetimeExpressions.scala for examples.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences

Posted by GitBox <gi...@apache.org>.
bersprockets commented on PR #36546:
URL: https://github.com/apache/spark/pull/36546#issuecomment-1126602057

   This PR brings `Date` in line with `Timestamp` (that is, time-zone aware).
   
   But even `Timestamp` sequences have some anomalies, e.g. (from a Spark without my change, in the America/Los_Angeles time-zone):
   ```
   spark-sql> select element_at(sequence(timestamp'2021-01-01', timestamp'2021-01-01' + interval 82 hours * 97, interval 82 hours), 97) as a;
   2021-11-24 23:00:00
   Time taken: 0.076 seconds, Fetched 1 row(s)
   spark-sql> select timestamp'2021-01-01' + interval 82 hours * 96 as x;
   2021-11-25 00:00:00
   Time taken: 0.053 seconds, Fetched 1 row(s)
   spark-sql> 
   ```
   The 96th (origin 0) element of the sequence from the first query is 1 hour less than the result of the second query. One would think they should be the same (both supposedly being `'2021-01-01' + interval 82 hours * 96 `), but the "fall back" is being handled differently around element 92 (origin 0) of the sequence.
   
   `Date` sequences also have (and will continue to have, after this PR) the same anomaly:
   ```
   spark-sql> select date'2021-01-01' + interval 82 hours * 96 as x;
   2021-11-25 00:00:00
   Time taken: 4.146 seconds, Fetched 1 row(s)
   spark-sql> select element_at(sequence(date'2021-01-01', date'2022-01-05', interval 82 hours), 97) as a;
   2021-11-24
   Time taken: 0.125 seconds, Fetched 1 row(s)
   spark-sql>  
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences

Posted by GitBox <gi...@apache.org>.
bersprockets commented on PR #36546:
URL: https://github.com/apache/spark/pull/36546#issuecomment-1126602056

   This PR brings `Date` in line with `Timestamp` (that is, time-zone aware).
   
   But even `Timestamp` sequences have some anomalies, e.g. (from a Spark without my change, in the America/Los_Angeles time-zone):
   ```
   spark-sql> select element_at(sequence(timestamp'2021-01-01', timestamp'2021-01-01' + interval 82 hours * 97, interval 82 hours), 97) as a;
   2021-11-24 23:00:00
   Time taken: 0.076 seconds, Fetched 1 row(s)
   spark-sql> select timestamp'2021-01-01' + interval 82 hours * 96 as x;
   2021-11-25 00:00:00
   Time taken: 0.053 seconds, Fetched 1 row(s)
   spark-sql> 
   ```
   The 96th (origin 0) element of the sequence from the first query is 1 hour less than the result of the second query. One would think they should be the same (both supposedly being `'2021-01-01' + interval 82 hours * 96 `), but the "fall back" is being handled differently around element 92 (origin 0) of the sequence.
   
   `Date` sequences also have (and will continue to have, after this PR) the same anomaly:
   ```
   spark-sql> select date'2021-01-01' + interval 82 hours * 96 as x;
   2021-11-25 00:00:00
   Time taken: 4.146 seconds, Fetched 1 row(s)
   spark-sql> select element_at(sequence(date'2021-01-01', date'2022-01-05', interval 82 hours), 97) as a;
   2021-11-24
   Time taken: 0.125 seconds, Fetched 1 row(s)
   spark-sql>  
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences
URL: https://github.com/apache/spark/pull/36546


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org