You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Enrico Minack (Jira)" <ji...@apache.org> on 2020/03/05 15:49:00 UTC

[jira] [Created] (SPARK-31056) Add CalendarIntervals division

Enrico Minack created SPARK-31056:
-------------------------------------

             Summary: Add CalendarIntervals division
                 Key: SPARK-31056
                 URL: https://issues.apache.org/jira/browse/SPARK-31056
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.1.0
            Reporter: Enrico Minack


{{CalendarInterval}} should be allowed for division. The {{CalendarInterval}} consists of three time components: {{months}}, {{days}} and {{microseconds}}. The division can only be defined between intervals that have a single non-zero time component, while both intervals have the same non-zero time component. Otherwise the division expression would be ambiguous.

This allows to evaluate the magnitude of {{CalendarInterval}} in SQL expressions:
{code}
Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 13:30:25")))
  .toDF("start", "end")
  .withColumn("interval", $"end" - $"start")
  .withColumn("interval [h]", $"interval" / lit("1 hour").cast(CalendarIntervalType))
  .withColumn("rate [€/h]", lit(1.45))
  .withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
  .show(false)
+-------------------+-------------------+-----------------------------+------------------+----------+------------------+
|start              |end                |interval                     |interval [h]      |rate [€/h]|price [€]         |
+-------------------+-------------------+-----------------------------+------------------+----------+------------------+
|2020-02-01 12:00:00|2020-02-01 13:30:25|1 hours 30 minutes 25 seconds|1.5069444444444444|1.45      |2.1850694444444443|
+-------------------+-------------------+-----------------------------+------------------+----------+------------------+
{code}

The currently available approach is

{code}
Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 13:30:25")))
  .toDF("start", "end")
  .withColumn("interval [s]", unix_timestamp($"end") - unix_timestamp($"start"))
  .withColumn("interval [h]", $"interval [s]" / 3600)
  .withColumn("rate [€/h]", lit(1.45))
  .withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
  .show(false)
{code}

Going through {{unix_timestamp}} is a hack and it pollutes the SQL query with unrelated semantics (unix timestamp is completely irrelevant for this computation). It is merely there because there is currently no way to access the length of an {{CalendarInterval}}. Dividing an interval by another interval provides means to measure the length in an arbitrary unit (minutes, hours, quarter hours).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org