You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Pete Baker (JIRA)" <ji...@apache.org> on 2016/08/29 16:35:20 UTC

[jira] [Created] (SPARK-17297) window function generates unexpected results due to startTime being relative to UTC

Pete Baker created SPARK-17297:
----------------------------------

             Summary: window function generates unexpected results due to startTime being relative to UTC
                 Key: SPARK-17297
                 URL: https://issues.apache.org/jira/browse/SPARK-17297
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Pete Baker


In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String slideDuration, String startTime)}} function {{startTime}} parameter behaves as follows:

{quote}
startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes.
{quote}

Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd expect to see events from each day fall into the correct day's bucket.   This doesn't happen as expected in every case, however, due to the way that this feature and timestamp / timezone support interact. 

Using a fixed UTC reference, there is an assumption that all days are the same length ({{1 day === 24 h}}}).  This is not the case for most timezones where the offset from UTC changes by 1h for 6 months out of the year.  In this case, on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long.

The result of this is that, for daylight savings time, some rows within 1h of midnight are aggregated to the wrong day.

Either:

* This is the expected behavior, and the window() function should not be used for this type of aggregation with a long window length and the {{window}} function documentation should be updated as such, or
* The window function should respect timezones and work on the assumption that {{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the local timezone, rather than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org