You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Driesprong, Fokko" <fo...@driesprong.frl> on 2020/08/19 18:34:29 UTC

Allow average out of a Date

Hi all,

Personally, I'm a big fan of the .summary() function to compute statistics
of a dataframe. I often use this for debugging pipelines, and check what
the impact of the RDD is after changing code.

I've noticed that not all datatypes are in this summary. Currently, there
is a list
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L267>
of types that allowed to be included in the summary, and I love to extend
that list.

The first one is the date type. It is important to define this together
with the community, and that we get consensus, as this will be part of the
public API. Changing this will be costly (or maybe impossible) to do.

I've checked what other DBMS'es do with averages out of dates:

Postgres

Unsupported:

postgres@366ecc8a0fb9:/$ psql
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
ERROR:  cannot cast type date to numeric
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
               ^

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
ERROR:  cannot cast type date to integer
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
               ^

The way to get the epoch in days is:

postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
date_part
-----------
    18422
(1 row)


MySQL

Converts to a YYYYMMDD format:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
+---------------------------------------------+
| CAST(CAST('2020-01-01' AS DATE) AS decimal) |
+---------------------------------------------+
|                                    20200101 |
+---------------------------------------------+
1 row in set (0.00 sec)

However, converting to an int, isn't allowed:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
ERROR 1064 (42000): You have an error in your SQL syntax; check the
manual that corresponds to your MySQL server version for the right
syntax to use near 'int)' at line 1

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
ERROR 1064 (42000): You have an error in your SQL syntax; check the
manual that corresponds to your MySQL server version for the right
syntax to use near 'bigint)' at line 1

Bigquery

Unsupported:

[image: image.png]

Excel

Converts it to the days since epoch. This feels weird, but I can see it, as
it is being used as a physical format internally in many data formats.

[image: image.png]

For me, returning a Date as the output of avg(date) seems like a logical
choice. Internally it is handled as dates since epoch, which makes sense as
well:

*Avro* it is milliseconds since epoch:
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java

*Parquet* it is days since epoch:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date

*ORC* is based around days since Epoch:
https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java

Also with this, we keep parity with the Catalyst type :)

Any further thoughts on this before moving forward?

Kind regards, Fokko

Re: Allow average out of a Date

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

> +1, we don't need to complicate the type system and claim that
date/timestamp can do avg/sum/etc. , we can just turn date/timestamp to
numeric, do whatever you want, and turn the result back to date/timestamp,
in the `summary` function.

+1 This is exactly my proposal in: [SPARK-10520][SQL] Allow average out of
DateType <https://github.com/apache/spark/pull/28754>

This will return a date out of avg(date), so the user isn't confused
because it gets to see an arbitrary number (in this case, days since epoch).

As a sidestep, I've illustrated the behaviour of avg(datetime), which
returns the seconds since epoch, but should als return a datetime in my
opinion: [SPARK-31981][SQL] Keep TimestampType when taking an average of a
Timestamp <https://github.com/apache/spark/pull/28821> However, changing
this means changing the public api :'( Feel free to chim in on the PR if
you're interested.

My proposal is to keep the input and output type the same.

Op ma 24 aug. 2020 om 05:04 schreef Wenchen Fan <cl...@gmail.com>:

> > can't we just teach that function how to do this using existing
> functionality w/o exposing all this internal numeric representation for
> dates?
>
> +1, we don't need to complicate the type system and claim that
> date/timestamp can do avg/sum/etc. , we can just turn date/timestamp to
> numeric, do whatever you want, and turn the result back to date/timestamp,
> in the `summary` function.
>
> On Mon, Aug 24, 2020 at 3:37 AM Greg Rahn <gr...@gmail.com> wrote:
>
>> I think your "working" timestamp example exposes the crux of the implicit
>> casting problem -- and PR#28554
>> <https://github.com/apache/spark/pull/28554> points to one of the other
>> prominent issues -- function return types.  The current
>> implementation seems to be problematic to me because there should not be
>> implicit conversions of timestamp types to double because there is no
>> single definition of how to represent a timestamp in double format.  It's
>> fine to have explicit casts or helper functions -- because then the user is
>> consciously choosing the "units" being used for a given numerical
>> representation.
>>
>> If the goal here is for the summary to work on dates, can't we just teach
>> that function how to do this using existing functionality w/o exposing all
>> this internal numeric representation for dates?
>>
>> Just to be clear -- I have no objection for the objective here -- I just
>> have concerns about the proposed implementation and the casting impacts.
>>
>> On Sat, Aug 22, 2020 at 12:28 PM Driesprong, Fokko <fo...@driesprong.frl>
>> wrote:
>>
>>> Thanks, Greg for your elaborate response, and sharing the table.
>>>
>>> The open PR is targetting SPARK-10520:
>>> https://github.com/apache/spark/pull/28554 Before opening a new ticket,
>>> I noticed that this was already an open issue. However, this will make it
>>> useable for Python/Java/Scala as well.
>>>
>>> Not being able to sum dates is still an open issue. For the current
>>> solution I've taken inspiration from the timestamp implementation in Spark:
>>>
>>> CREATE VIEW timestamps
>>>     AS SELECT explode(array(
>>>                   CAST('2020-07-01 12:00:00' AS timestamp),
>>>                   CAST('2020-07-02 18:00:00' AS timestamp)
>>>               )) AS a;
>>>
>>> SELECT SUM(a), SUM(a) / COUNT(a), AVG(a) FROM timestamps;
>>>
>>> Sum:         3.1873032E9
>>> Sum / Count: 1.5936516E9
>>> Avg:         1.5936516E9
>>>
>>> Using Spark 2.4.5. The datatype is implicitly converted a
>>> DoubleType (seconds since epoch), while I believe this should be an
>>> explicit choice. I've started the discussion
>>> <https://github.com/apache/spark/pull/28821> in a separate PR.
>>>
>>> I fully agree that summing/subtracting dates/timestamps don't look very
>>> sensible. However, taking an average can be of use. For example, if you're
>>> sessionising user activity or log entries, you might want to have a single
>>> date/timestamp to represent the session. Taking an average (or a median?)
>>> would make sense in such a case.
>>>
>>> > As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if
>>> AVG() works on DATE, then SUM() (and the plus operator) would need to work
>>> as well -- but what is the definition of SUM() for a list of DATEs or
>>> DATE + DATE? There isn't any such definition in the SQL world. And take
>>> a very simple example -- what's the AVG() of today and yesterday? If you
>>> represent these dates as their Julian integer value, the average of two
>>> consecutive integers is a floating point number between them -- so now
>>> what?  Should it be ceil() or floor()? Round down or round up? Should it be
>>> yesterday or today?  It seems very challenging to deal with this in a
>>> formal and standardized way.
>>>
>>> I would not bother the end-user with all these questions, as I would
>>> expect in most cases this isn't an issue. If the user wants to have full
>>> control, it can first cast the date to a double, which gives days since
>>> epoch. Then it can have full control over possible rounding etc. Next step
>>> would be taking the std-dev out of a Date column, I would suggest using an
>>> interval type to represent the deviation, but let's take one step at the
>>> time.
>>>
>>> > It's my understanding that it's this ambiguity, and the need for an
>>> intermediate numerical representation like the Julian Day Number, that
>>> limits this from being an implicit SQL data type conversion operation and
>>> why users are forced to choose their own explicit conversion (to Julian
>>> date, for example) in order to do such mathematical operations. As we see
>>> in the SQL:2016 standard cast specification and conversion table (below) --
>>> there is no standardized way to implicitly cast between DATE and numeric
>>> types because there is not just one numerical representation of a DATE.
>>>
>>> For now, I've decided to stick with the days since Unix epoch, as it
>>> seems to be popular among file formats:
>>>
>>> *Avro* it is milliseconds since Unix epoch:
>>> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
>>> *Parquet* it is days since Unix epoch:
>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
>>> *ORC* is days since Unix epoch:
>>> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
>>> *Spark *is days since Unix epoch:
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala#L36
>>>
>>> > If possible, this would seem to yield the desired behavior for the use
>>> case in question, but not broaden the requirement so wide that it becomes
>>> too challenging to implement and conflicts with other well defined
>>> standards, like ANSI SQL:2016.
>>>
>>> I see your point, and I don't like to reinvent the standards. But the
>>> ANSI standard is clearly too limited, as you might also want to do things
>>> that you normally wouldn't do in a database (for example, when it comes to
>>> feature generation or the abovementioned sessionising). While looking at
>>> other RDBM'ses, it is clear that there isn't a uniform way of doing this,
>>> as shown in my initial mail. For example, you won't be able to sum the cast
>>> output (YYYYMMDD) format of MySQL, because you might get invalid
>>> months/days in the output (oktober+oktober=month 20?).
>>>
>>> Currently, it isn't possible to sum dates, if this is something that you
>>> want to have in there, I can add it to the PR. For the .summary() function
>>> it isn't required. The workaround is to cast it to a double first and then
>>> sum the days since epoch.
>>>
>>> Kind regards, Fokko
>>>
>>>
>>> Op vr 21 aug. 2020 om 18:13 schreef Greg Rahn <gr...@gmail.com>:
>>>
>>>> Fokko-
>>>>
>>>> I can certainly understand the ask for this behavior and it seems
>>>> simple and/or straightforward at face value -- as R's Date Class behaves
>>>> this way, however, I think there are several challenges involved here, as
>>>> AFAIK, there seems to be no well defined way SQL systems do this implicitly
>>>> today -- and I think there is a clear explanation of why.
>>>>
>>>> As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if
>>>> AVG() works on DATE, then SUM() (and the plus operator) would need to work
>>>> as well -- but what is the definition of SUM() for a list of DATEs or
>>>> DATE + DATE? There isn't any such definition in the SQL world. And
>>>> take a very simple example -- what's the AVG() of today and yesterday? If
>>>> you represent these dates as their Julian integer value, the average of two
>>>> consecutive integers is a floating point number between them -- so now
>>>> what?  Should it be ceil() or floor()? Round down or round up? Should it be
>>>> yesterday or today?  It seems very challenging to deal with this in a
>>>> formal and standardized way.
>>>>
>>>> It's my understanding that it's this ambiguity, and the need for an
>>>> intermediate numerical representation like the Julian Day Number, that
>>>> limits this from being an implicit SQL data type conversion operation and
>>>> why users are forced to choose their own explicit conversion (to Julian
>>>> date, for example) in order to do such mathematical operations. As we see
>>>> in the SQL:2016 standard cast specification and conversion table (below) --
>>>> there is no standardized way to implicitly cast between DATE and numeric
>>>> types because there is not just one numerical representation of a DATE.
>>>>
>>>> If we look back at the use case from SPARK-10520 which is about R's
>>>> summary function on a vector of dates -- (and I don't claim to be an R
>>>> internals expert) but maybe it makes more sense to implement R's Date Class
>>>> behavior in SparkR vs more generically across multiple languages and
>>>> types.  If possible, this would seem to yield the desired behavior for the
>>>> use case in question, but not broaden the requirement so wide that it
>>>> becomes too challenging to implement and conflicts with other well defined
>>>> standards, like ANSI SQL:2016.
>>>>
>>>>
>>>> [image: Screen Shot 2020-08-21 at 12.00.01 PM.png]
>>>>
>>>>
>>>> For reference:
>>>> https://github.com/apache/spark/pull/28754
>>>> https://issues.apache.org/jira/browse/SPARK-10520
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <fo...@driesprong.frl>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Personally, I'm a big fan of the .summary() function to compute
>>>>> statistics of a dataframe. I often use this for debugging pipelines, and
>>>>> check what the impact of the RDD is after changing code.
>>>>>
>>>>> I've noticed that not all datatypes are in this summary. Currently,
>>>>> there is a list
>>>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L267>
>>>>> of types that allowed to be included in the summary, and I love to extend
>>>>> that list.
>>>>>
>>>>> The first one is the date type. It is important to define this
>>>>> together with the community, and that we get consensus, as this will be
>>>>> part of the public API. Changing this will be costly (or maybe impossible)
>>>>> to do.
>>>>>
>>>>> I've checked what other DBMS'es do with averages out of dates:
>>>>>
>>>>> Postgres
>>>>>
>>>>> Unsupported:
>>>>>
>>>>> postgres@366ecc8a0fb9:/$ psql
>>>>> psql (12.3 (Debian 12.3-1.pgdg100+1))
>>>>> Type "help" for help.
>>>>>
>>>>> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>>>> ERROR:  cannot cast type date to numeric
>>>>> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>>>>                ^
>>>>>
>>>>> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>>>>> ERROR:  cannot cast type date to integer
>>>>> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>>>>>                ^
>>>>>
>>>>> The way to get the epoch in days is:
>>>>>
>>>>> postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
>>>>> date_part
>>>>> -----------
>>>>>     18422
>>>>> (1 row)
>>>>>
>>>>>
>>>>> MySQL
>>>>>
>>>>> Converts to a YYYYMMDD format:
>>>>>
>>>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>>>> +---------------------------------------------+
>>>>> | CAST(CAST('2020-01-01' AS DATE) AS decimal) |
>>>>> +---------------------------------------------+
>>>>> |                                    20200101 |
>>>>> +---------------------------------------------+
>>>>> 1 row in set (0.00 sec)
>>>>>
>>>>> However, converting to an int, isn't allowed:
>>>>>
>>>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
>>>>> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1
>>>>>
>>>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
>>>>> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1
>>>>>
>>>>> Bigquery
>>>>>
>>>>> Unsupported:
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> Excel
>>>>>
>>>>> Converts it to the days since epoch. This feels weird, but I can see
>>>>> it, as it is being used as a physical format internally in many data
>>>>> formats.
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> For me, returning a Date as the output of avg(date) seems like a
>>>>> logical choice. Internally it is handled as dates since epoch, which makes
>>>>> sense as well:
>>>>>
>>>>> *Avro* it is milliseconds since epoch:
>>>>>
>>>>> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
>>>>>
>>>>> *Parquet* it is days since epoch:
>>>>>
>>>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
>>>>>
>>>>> *ORC* is based around days since Epoch:
>>>>>
>>>>> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
>>>>>
>>>>> Also with this, we keep parity with the Catalyst type :)
>>>>>
>>>>> Any further thoughts on this before moving forward?
>>>>>
>>>>> Kind regards, Fokko
>>>>>
>>>>>
>>>>>

Re: Allow average out of a Date

Posted by Wenchen Fan <cl...@gmail.com>.

> can't we just teach that function how to do this using existing
functionality w/o exposing all this internal numeric representation for
dates?

+1, we don't need to complicate the type system and claim that
date/timestamp can do avg/sum/etc. , we can just turn date/timestamp to
numeric, do whatever you want, and turn the result back to date/timestamp,
in the `summary` function.

On Mon, Aug 24, 2020 at 3:37 AM Greg Rahn <gr...@gmail.com> wrote:

> I think your "working" timestamp example exposes the crux of the implicit
> casting problem -- and PR#28554
> <https://github.com/apache/spark/pull/28554> points to one of the other
> prominent issues -- function return types.  The current
> implementation seems to be problematic to me because there should not be
> implicit conversions of timestamp types to double because there is no
> single definition of how to represent a timestamp in double format.  It's
> fine to have explicit casts or helper functions -- because then the user is
> consciously choosing the "units" being used for a given numerical
> representation.
>
> If the goal here is for the summary to work on dates, can't we just teach
> that function how to do this using existing functionality w/o exposing all
> this internal numeric representation for dates?
>
> Just to be clear -- I have no objection for the objective here -- I just
> have concerns about the proposed implementation and the casting impacts.
>
> On Sat, Aug 22, 2020 at 12:28 PM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
>> Thanks, Greg for your elaborate response, and sharing the table.
>>
>> The open PR is targetting SPARK-10520:
>> https://github.com/apache/spark/pull/28554 Before opening a new ticket,
>> I noticed that this was already an open issue. However, this will make it
>> useable for Python/Java/Scala as well.
>>
>> Not being able to sum dates is still an open issue. For the current
>> solution I've taken inspiration from the timestamp implementation in Spark:
>>
>> CREATE VIEW timestamps
>>     AS SELECT explode(array(
>>                   CAST('2020-07-01 12:00:00' AS timestamp),
>>                   CAST('2020-07-02 18:00:00' AS timestamp)
>>               )) AS a;
>>
>> SELECT SUM(a), SUM(a) / COUNT(a), AVG(a) FROM timestamps;
>>
>> Sum:         3.1873032E9
>> Sum / Count: 1.5936516E9
>> Avg:         1.5936516E9
>>
>> Using Spark 2.4.5. The datatype is implicitly converted a
>> DoubleType (seconds since epoch), while I believe this should be an
>> explicit choice. I've started the discussion
>> <https://github.com/apache/spark/pull/28821> in a separate PR.
>>
>> I fully agree that summing/subtracting dates/timestamps don't look very
>> sensible. However, taking an average can be of use. For example, if you're
>> sessionising user activity or log entries, you might want to have a single
>> date/timestamp to represent the session. Taking an average (or a median?)
>> would make sense in such a case.
>>
>> > As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if
>> AVG() works on DATE, then SUM() (and the plus operator) would need to work
>> as well -- but what is the definition of SUM() for a list of DATEs or
>> DATE + DATE? There isn't any such definition in the SQL world. And take
>> a very simple example -- what's the AVG() of today and yesterday? If you
>> represent these dates as their Julian integer value, the average of two
>> consecutive integers is a floating point number between them -- so now
>> what?  Should it be ceil() or floor()? Round down or round up? Should it be
>> yesterday or today?  It seems very challenging to deal with this in a
>> formal and standardized way.
>>
>> I would not bother the end-user with all these questions, as I would
>> expect in most cases this isn't an issue. If the user wants to have full
>> control, it can first cast the date to a double, which gives days since
>> epoch. Then it can have full control over possible rounding etc. Next step
>> would be taking the std-dev out of a Date column, I would suggest using an
>> interval type to represent the deviation, but let's take one step at the
>> time.
>>
>> > It's my understanding that it's this ambiguity, and the need for an
>> intermediate numerical representation like the Julian Day Number, that
>> limits this from being an implicit SQL data type conversion operation and
>> why users are forced to choose their own explicit conversion (to Julian
>> date, for example) in order to do such mathematical operations. As we see
>> in the SQL:2016 standard cast specification and conversion table (below) --
>> there is no standardized way to implicitly cast between DATE and numeric
>> types because there is not just one numerical representation of a DATE.
>>
>> For now, I've decided to stick with the days since Unix epoch, as it
>> seems to be popular among file formats:
>>
>> *Avro* it is milliseconds since Unix epoch:
>> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
>> *Parquet* it is days since Unix epoch:
>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
>> *ORC* is days since Unix epoch:
>> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
>> *Spark *is days since Unix epoch:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala#L36
>>
>> > If possible, this would seem to yield the desired behavior for the use
>> case in question, but not broaden the requirement so wide that it becomes
>> too challenging to implement and conflicts with other well defined
>> standards, like ANSI SQL:2016.
>>
>> I see your point, and I don't like to reinvent the standards. But the
>> ANSI standard is clearly too limited, as you might also want to do things
>> that you normally wouldn't do in a database (for example, when it comes to
>> feature generation or the abovementioned sessionising). While looking at
>> other RDBM'ses, it is clear that there isn't a uniform way of doing this,
>> as shown in my initial mail. For example, you won't be able to sum the cast
>> output (YYYYMMDD) format of MySQL, because you might get invalid
>> months/days in the output (oktober+oktober=month 20?).
>>
>> Currently, it isn't possible to sum dates, if this is something that you
>> want to have in there, I can add it to the PR. For the .summary() function
>> it isn't required. The workaround is to cast it to a double first and then
>> sum the days since epoch.
>>
>> Kind regards, Fokko
>>
>>
>> Op vr 21 aug. 2020 om 18:13 schreef Greg Rahn <gr...@gmail.com>:
>>
>>> Fokko-
>>>
>>> I can certainly understand the ask for this behavior and it seems simple
>>> and/or straightforward at face value -- as R's Date Class behaves this way,
>>> however, I think there are several challenges involved here, as AFAIK,
>>> there seems to be no well defined way SQL systems do this implicitly today
>>> -- and I think there is a clear explanation of why.
>>>
>>> As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if
>>> AVG() works on DATE, then SUM() (and the plus operator) would need to work
>>> as well -- but what is the definition of SUM() for a list of DATEs or
>>> DATE + DATE? There isn't any such definition in the SQL world. And take
>>> a very simple example -- what's the AVG() of today and yesterday? If you
>>> represent these dates as their Julian integer value, the average of two
>>> consecutive integers is a floating point number between them -- so now
>>> what?  Should it be ceil() or floor()? Round down or round up? Should it be
>>> yesterday or today?  It seems very challenging to deal with this in a
>>> formal and standardized way.
>>>
>>> It's my understanding that it's this ambiguity, and the need for an
>>> intermediate numerical representation like the Julian Day Number, that
>>> limits this from being an implicit SQL data type conversion operation and
>>> why users are forced to choose their own explicit conversion (to Julian
>>> date, for example) in order to do such mathematical operations. As we see
>>> in the SQL:2016 standard cast specification and conversion table (below) --
>>> there is no standardized way to implicitly cast between DATE and numeric
>>> types because there is not just one numerical representation of a DATE.
>>>
>>> If we look back at the use case from SPARK-10520 which is about R's
>>> summary function on a vector of dates -- (and I don't claim to be an R
>>> internals expert) but maybe it makes more sense to implement R's Date Class
>>> behavior in SparkR vs more generically across multiple languages and
>>> types.  If possible, this would seem to yield the desired behavior for the
>>> use case in question, but not broaden the requirement so wide that it
>>> becomes too challenging to implement and conflicts with other well defined
>>> standards, like ANSI SQL:2016.
>>>
>>>
>>> [image: Screen Shot 2020-08-21 at 12.00.01 PM.png]
>>>
>>>
>>> For reference:
>>> https://github.com/apache/spark/pull/28754
>>> https://issues.apache.org/jira/browse/SPARK-10520
>>>
>>>
>>>
>>>
>>> On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <fo...@driesprong.frl>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Personally, I'm a big fan of the .summary() function to compute
>>>> statistics of a dataframe. I often use this for debugging pipelines, and
>>>> check what the impact of the RDD is after changing code.
>>>>
>>>> I've noticed that not all datatypes are in this summary. Currently,
>>>> there is a list
>>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L267>
>>>> of types that allowed to be included in the summary, and I love to extend
>>>> that list.
>>>>
>>>> The first one is the date type. It is important to define this together
>>>> with the community, and that we get consensus, as this will be part of the
>>>> public API. Changing this will be costly (or maybe impossible) to do.
>>>>
>>>> I've checked what other DBMS'es do with averages out of dates:
>>>>
>>>> Postgres
>>>>
>>>> Unsupported:
>>>>
>>>> postgres@366ecc8a0fb9:/$ psql
>>>> psql (12.3 (Debian 12.3-1.pgdg100+1))
>>>> Type "help" for help.
>>>>
>>>> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>>> ERROR:  cannot cast type date to numeric
>>>> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>>>                ^
>>>>
>>>> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>>>> ERROR:  cannot cast type date to integer
>>>> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>>>>                ^
>>>>
>>>> The way to get the epoch in days is:
>>>>
>>>> postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
>>>> date_part
>>>> -----------
>>>>     18422
>>>> (1 row)
>>>>
>>>>
>>>> MySQL
>>>>
>>>> Converts to a YYYYMMDD format:
>>>>
>>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>>> +---------------------------------------------+
>>>> | CAST(CAST('2020-01-01' AS DATE) AS decimal) |
>>>> +---------------------------------------------+
>>>> |                                    20200101 |
>>>> +---------------------------------------------+
>>>> 1 row in set (0.00 sec)
>>>>
>>>> However, converting to an int, isn't allowed:
>>>>
>>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
>>>> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1
>>>>
>>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
>>>> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1
>>>>
>>>> Bigquery
>>>>
>>>> Unsupported:
>>>>
>>>> [image: image.png]
>>>>
>>>> Excel
>>>>
>>>> Converts it to the days since epoch. This feels weird, but I can see
>>>> it, as it is being used as a physical format internally in many data
>>>> formats.
>>>>
>>>> [image: image.png]
>>>>
>>>> For me, returning a Date as the output of avg(date) seems like a
>>>> logical choice. Internally it is handled as dates since epoch, which makes
>>>> sense as well:
>>>>
>>>> *Avro* it is milliseconds since epoch:
>>>>
>>>> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
>>>>
>>>> *Parquet* it is days since epoch:
>>>>
>>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
>>>>
>>>> *ORC* is based around days since Epoch:
>>>>
>>>> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
>>>>
>>>> Also with this, we keep parity with the Catalyst type :)
>>>>
>>>> Any further thoughts on this before moving forward?
>>>>
>>>> Kind regards, Fokko
>>>>
>>>>
>>>>

Re: Allow average out of a Date

Posted by Greg Rahn <gr...@gmail.com>.

I think your "working" timestamp example exposes the crux of the implicit
casting problem -- and PR#28554
<https://github.com/apache/spark/pull/28554> points
to one of the other prominent issues -- function return types.  The current
implementation seems to be problematic to me because there should not be
implicit conversions of timestamp types to double because there is no
single definition of how to represent a timestamp in double format.  It's
fine to have explicit casts or helper functions -- because then the user is
consciously choosing the "units" being used for a given numerical
representation.

If the goal here is for the summary to work on dates, can't we just teach
that function how to do this using existing functionality w/o exposing all
this internal numeric representation for dates?

Just to be clear -- I have no objection for the objective here -- I just
have concerns about the proposed implementation and the casting impacts.

On Sat, Aug 22, 2020 at 12:28 PM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Thanks, Greg for your elaborate response, and sharing the table.
>
> The open PR is targetting SPARK-10520:
> https://github.com/apache/spark/pull/28554 Before opening a new ticket, I
> noticed that this was already an open issue. However, this will make it
> useable for Python/Java/Scala as well.
>
> Not being able to sum dates is still an open issue. For the current
> solution I've taken inspiration from the timestamp implementation in Spark:
>
> CREATE VIEW timestamps
>     AS SELECT explode(array(
>                   CAST('2020-07-01 12:00:00' AS timestamp),
>                   CAST('2020-07-02 18:00:00' AS timestamp)
>               )) AS a;
>
> SELECT SUM(a), SUM(a) / COUNT(a), AVG(a) FROM timestamps;
>
> Sum:         3.1873032E9
> Sum / Count: 1.5936516E9
> Avg:         1.5936516E9
>
> Using Spark 2.4.5. The datatype is implicitly converted a
> DoubleType (seconds since epoch), while I believe this should be an
> explicit choice. I've started the discussion
> <https://github.com/apache/spark/pull/28821> in a separate PR.
>
> I fully agree that summing/subtracting dates/timestamps don't look very
> sensible. However, taking an average can be of use. For example, if you're
> sessionising user activity or log entries, you might want to have a single
> date/timestamp to represent the session. Taking an average (or a median?)
> would make sense in such a case.
>
> > As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if
> AVG() works on DATE, then SUM() (and the plus operator) would need to work
> as well -- but what is the definition of SUM() for a list of DATEs or
> DATE + DATE? There isn't any such definition in the SQL world. And take a
> very simple example -- what's the AVG() of today and yesterday? If you
> represent these dates as their Julian integer value, the average of two
> consecutive integers is a floating point number between them -- so now
> what?  Should it be ceil() or floor()? Round down or round up? Should it be
> yesterday or today?  It seems very challenging to deal with this in a
> formal and standardized way.
>
> I would not bother the end-user with all these questions, as I would
> expect in most cases this isn't an issue. If the user wants to have full
> control, it can first cast the date to a double, which gives days since
> epoch. Then it can have full control over possible rounding etc. Next step
> would be taking the std-dev out of a Date column, I would suggest using an
> interval type to represent the deviation, but let's take one step at the
> time.
>
> > It's my understanding that it's this ambiguity, and the need for an
> intermediate numerical representation like the Julian Day Number, that
> limits this from being an implicit SQL data type conversion operation and
> why users are forced to choose their own explicit conversion (to Julian
> date, for example) in order to do such mathematical operations. As we see
> in the SQL:2016 standard cast specification and conversion table (below) --
> there is no standardized way to implicitly cast between DATE and numeric
> types because there is not just one numerical representation of a DATE.
>
> For now, I've decided to stick with the days since Unix epoch, as it seems
> to be popular among file formats:
>
> *Avro* it is milliseconds since Unix epoch:
> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
> *Parquet* it is days since Unix epoch:
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
> *ORC* is days since Unix epoch:
> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
> *Spark *is days since Unix epoch:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala#L36
>
> > If possible, this would seem to yield the desired behavior for the use
> case in question, but not broaden the requirement so wide that it becomes
> too challenging to implement and conflicts with other well defined
> standards, like ANSI SQL:2016.
>
> I see your point, and I don't like to reinvent the standards. But the ANSI
> standard is clearly too limited, as you might also want to do things that
> you normally wouldn't do in a database (for example, when it comes to
> feature generation or the abovementioned sessionising). While looking at
> other RDBM'ses, it is clear that there isn't a uniform way of doing this,
> as shown in my initial mail. For example, you won't be able to sum the cast
> output (YYYYMMDD) format of MySQL, because you might get invalid
> months/days in the output (oktober+oktober=month 20?).
>
> Currently, it isn't possible to sum dates, if this is something that you
> want to have in there, I can add it to the PR. For the .summary() function
> it isn't required. The workaround is to cast it to a double first and then
> sum the days since epoch.
>
> Kind regards, Fokko
>
>
> Op vr 21 aug. 2020 om 18:13 schreef Greg Rahn <gr...@gmail.com>:
>
>> Fokko-
>>
>> I can certainly understand the ask for this behavior and it seems simple
>> and/or straightforward at face value -- as R's Date Class behaves this way,
>> however, I think there are several challenges involved here, as AFAIK,
>> there seems to be no well defined way SQL systems do this implicitly today
>> -- and I think there is a clear explanation of why.
>>
>> As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if
>> AVG() works on DATE, then SUM() (and the plus operator) would need to work
>> as well -- but what is the definition of SUM() for a list of DATEs or
>> DATE + DATE? There isn't any such definition in the SQL world. And take
>> a very simple example -- what's the AVG() of today and yesterday? If you
>> represent these dates as their Julian integer value, the average of two
>> consecutive integers is a floating point number between them -- so now
>> what?  Should it be ceil() or floor()? Round down or round up? Should it be
>> yesterday or today?  It seems very challenging to deal with this in a
>> formal and standardized way.
>>
>> It's my understanding that it's this ambiguity, and the need for an
>> intermediate numerical representation like the Julian Day Number, that
>> limits this from being an implicit SQL data type conversion operation and
>> why users are forced to choose their own explicit conversion (to Julian
>> date, for example) in order to do such mathematical operations. As we see
>> in the SQL:2016 standard cast specification and conversion table (below) --
>> there is no standardized way to implicitly cast between DATE and numeric
>> types because there is not just one numerical representation of a DATE.
>>
>> If we look back at the use case from SPARK-10520 which is about R's
>> summary function on a vector of dates -- (and I don't claim to be an R
>> internals expert) but maybe it makes more sense to implement R's Date Class
>> behavior in SparkR vs more generically across multiple languages and
>> types.  If possible, this would seem to yield the desired behavior for the
>> use case in question, but not broaden the requirement so wide that it
>> becomes too challenging to implement and conflicts with other well defined
>> standards, like ANSI SQL:2016.
>>
>>
>> [image: Screen Shot 2020-08-21 at 12.00.01 PM.png]
>>
>>
>> For reference:
>> https://github.com/apache/spark/pull/28754
>> https://issues.apache.org/jira/browse/SPARK-10520
>>
>>
>>
>>
>> On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <fo...@driesprong.frl>
>> wrote:
>>
>>> Hi all,
>>>
>>> Personally, I'm a big fan of the .summary() function to compute
>>> statistics of a dataframe. I often use this for debugging pipelines, and
>>> check what the impact of the RDD is after changing code.
>>>
>>> I've noticed that not all datatypes are in this summary. Currently,
>>> there is a list
>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L267>
>>> of types that allowed to be included in the summary, and I love to extend
>>> that list.
>>>
>>> The first one is the date type. It is important to define this together
>>> with the community, and that we get consensus, as this will be part of the
>>> public API. Changing this will be costly (or maybe impossible) to do.
>>>
>>> I've checked what other DBMS'es do with averages out of dates:
>>>
>>> Postgres
>>>
>>> Unsupported:
>>>
>>> postgres@366ecc8a0fb9:/$ psql
>>> psql (12.3 (Debian 12.3-1.pgdg100+1))
>>> Type "help" for help.
>>>
>>> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>> ERROR:  cannot cast type date to numeric
>>> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>>                ^
>>>
>>> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>>> ERROR:  cannot cast type date to integer
>>> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>>>                ^
>>>
>>> The way to get the epoch in days is:
>>>
>>> postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
>>> date_part
>>> -----------
>>>     18422
>>> (1 row)
>>>
>>>
>>> MySQL
>>>
>>> Converts to a YYYYMMDD format:
>>>
>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>> +---------------------------------------------+
>>> | CAST(CAST('2020-01-01' AS DATE) AS decimal) |
>>> +---------------------------------------------+
>>> |                                    20200101 |
>>> +---------------------------------------------+
>>> 1 row in set (0.00 sec)
>>>
>>> However, converting to an int, isn't allowed:
>>>
>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
>>> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1
>>>
>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
>>> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1
>>>
>>> Bigquery
>>>
>>> Unsupported:
>>>
>>> [image: image.png]
>>>
>>> Excel
>>>
>>> Converts it to the days since epoch. This feels weird, but I can see it,
>>> as it is being used as a physical format internally in many data formats.
>>>
>>> [image: image.png]
>>>
>>> For me, returning a Date as the output of avg(date) seems like a logical
>>> choice. Internally it is handled as dates since epoch, which makes sense as
>>> well:
>>>
>>> *Avro* it is milliseconds since epoch:
>>>
>>> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
>>>
>>> *Parquet* it is days since epoch:
>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
>>>
>>> *ORC* is based around days since Epoch:
>>>
>>> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
>>>
>>> Also with this, we keep parity with the Catalyst type :)
>>>
>>> Any further thoughts on this before moving forward?
>>>
>>> Kind regards, Fokko
>>>
>>>
>>>

Re: Allow average out of a Date

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

Thanks, Greg for your elaborate response, and sharing the table.

The open PR is targetting SPARK-10520:
https://github.com/apache/spark/pull/28554 Before opening a new ticket, I
noticed that this was already an open issue. However, this will make it
useable for Python/Java/Scala as well.

Not being able to sum dates is still an open issue. For the current
solution I've taken inspiration from the timestamp implementation in Spark:

CREATE VIEW timestamps
    AS SELECT explode(array(
                  CAST('2020-07-01 12:00:00' AS timestamp),
                  CAST('2020-07-02 18:00:00' AS timestamp)
              )) AS a;

SELECT SUM(a), SUM(a) / COUNT(a), AVG(a) FROM timestamps;

Sum:         3.1873032E9
Sum / Count: 1.5936516E9
Avg:         1.5936516E9

Using Spark 2.4.5. The datatype is implicitly converted a
DoubleType (seconds since epoch), while I believe this should be an
explicit choice. I've started the discussion
<https://github.com/apache/spark/pull/28821> in a separate PR.

I fully agree that summing/subtracting dates/timestamps don't look very
sensible. However, taking an average can be of use. For example, if you're
sessionising user activity or log entries, you might want to have a single
date/timestamp to represent the session. Taking an average (or a median?)
would make sense in such a case.

> As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if
AVG() works on DATE, then SUM() (and the plus operator) would need to work
as well -- but what is the definition of SUM() for a list of DATEs or
DATE + DATE? There isn't any such definition in the SQL world. And take a
very simple example -- what's the AVG() of today and yesterday? If you
represent these dates as their Julian integer value, the average of two
consecutive integers is a floating point number between them -- so now
what?  Should it be ceil() or floor()? Round down or round up? Should it be
yesterday or today?  It seems very challenging to deal with this in a
formal and standardized way.

I would not bother the end-user with all these questions, as I would expect
in most cases this isn't an issue. If the user wants to have full control,
it can first cast the date to a double, which gives days since epoch. Then
it can have full control over possible rounding etc. Next step would be
taking the std-dev out of a Date column, I would suggest using an interval
type to represent the deviation, but let's take one step at the time.

> It's my understanding that it's this ambiguity, and the need for an
intermediate numerical representation like the Julian Day Number, that
limits this from being an implicit SQL data type conversion operation and
why users are forced to choose their own explicit conversion (to Julian
date, for example) in order to do such mathematical operations. As we see
in the SQL:2016 standard cast specification and conversion table (below) --
there is no standardized way to implicitly cast between DATE and numeric
types because there is not just one numerical representation of a DATE.

For now, I've decided to stick with the days since Unix epoch, as it seems
to be popular among file formats:

*Avro* it is milliseconds since Unix epoch:
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
*Parquet* it is days since Unix epoch:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
*ORC* is days since Unix epoch:
https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
*Spark *is days since Unix epoch:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala#L36

> If possible, this would seem to yield the desired behavior for the use
case in question, but not broaden the requirement so wide that it becomes
too challenging to implement and conflicts with other well defined
standards, like ANSI SQL:2016.

I see your point, and I don't like to reinvent the standards. But the ANSI
standard is clearly too limited, as you might also want to do things that
you normally wouldn't do in a database (for example, when it comes to
feature generation or the abovementioned sessionising). While looking at
other RDBM'ses, it is clear that there isn't a uniform way of doing this,
as shown in my initial mail. For example, you won't be able to sum the cast
output (YYYYMMDD) format of MySQL, because you might get invalid
months/days in the output (oktober+oktober=month 20?).

Currently, it isn't possible to sum dates, if this is something that you
want to have in there, I can add it to the PR. For the .summary() function
it isn't required. The workaround is to cast it to a double first and then
sum the days since epoch.

Kind regards, Fokko


Op vr 21 aug. 2020 om 18:13 schreef Greg Rahn <gr...@gmail.com>:

> Fokko-
>
> I can certainly understand the ask for this behavior and it seems simple
> and/or straightforward at face value -- as R's Date Class behaves this way,
> however, I think there are several challenges involved here, as AFAIK,
> there seems to be no well defined way SQL systems do this implicitly today
> -- and I think there is a clear explanation of why.
>
> As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if AVG()
> works on DATE, then SUM() (and the plus operator) would need to work as
> well -- but what is the definition of SUM() for a list of DATEs or DATE +
> DATE? There isn't any such definition in the SQL world. And take a very
> simple example -- what's the AVG() of today and yesterday? If you represent
> these dates as their Julian integer value, the average of two
> consecutive integers is a floating point number between them -- so now
> what?  Should it be ceil() or floor()? Round down or round up? Should it be
> yesterday or today?  It seems very challenging to deal with this in a
> formal and standardized way.
>
> It's my understanding that it's this ambiguity, and the need for an
> intermediate numerical representation like the Julian Day Number, that
> limits this from being an implicit SQL data type conversion operation and
> why users are forced to choose their own explicit conversion (to Julian
> date, for example) in order to do such mathematical operations. As we see
> in the SQL:2016 standard cast specification and conversion table (below) --
> there is no standardized way to implicitly cast between DATE and numeric
> types because there is not just one numerical representation of a DATE.
>
> If we look back at the use case from SPARK-10520 which is about R's
> summary function on a vector of dates -- (and I don't claim to be an R
> internals expert) but maybe it makes more sense to implement R's Date Class
> behavior in SparkR vs more generically across multiple languages and
> types.  If possible, this would seem to yield the desired behavior for the
> use case in question, but not broaden the requirement so wide that it
> becomes too challenging to implement and conflicts with other well defined
> standards, like ANSI SQL:2016.
>
>
> [image: Screen Shot 2020-08-21 at 12.00.01 PM.png]
>
>
> For reference:
> https://github.com/apache/spark/pull/28754
> https://issues.apache.org/jira/browse/SPARK-10520
>
>
>
>
> On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
>> Hi all,
>>
>> Personally, I'm a big fan of the .summary() function to compute
>> statistics of a dataframe. I often use this for debugging pipelines, and
>> check what the impact of the RDD is after changing code.
>>
>> I've noticed that not all datatypes are in this summary. Currently, there
>> is a list
>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L267>
>> of types that allowed to be included in the summary, and I love to extend
>> that list.
>>
>> The first one is the date type. It is important to define this together
>> with the community, and that we get consensus, as this will be part of the
>> public API. Changing this will be costly (or maybe impossible) to do.
>>
>> I've checked what other DBMS'es do with averages out of dates:
>>
>> Postgres
>>
>> Unsupported:
>>
>> postgres@366ecc8a0fb9:/$ psql
>> psql (12.3 (Debian 12.3-1.pgdg100+1))
>> Type "help" for help.
>>
>> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>> ERROR:  cannot cast type date to numeric
>> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>                ^
>>
>> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>> ERROR:  cannot cast type date to integer
>> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>>                ^
>>
>> The way to get the epoch in days is:
>>
>> postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
>> date_part
>> -----------
>>     18422
>> (1 row)
>>
>>
>> MySQL
>>
>> Converts to a YYYYMMDD format:
>>
>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>> +---------------------------------------------+
>> | CAST(CAST('2020-01-01' AS DATE) AS decimal) |
>> +---------------------------------------------+
>> |                                    20200101 |
>> +---------------------------------------------+
>> 1 row in set (0.00 sec)
>>
>> However, converting to an int, isn't allowed:
>>
>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
>> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1
>>
>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
>> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1
>>
>> Bigquery
>>
>> Unsupported:
>>
>> [image: image.png]
>>
>> Excel
>>
>> Converts it to the days since epoch. This feels weird, but I can see it,
>> as it is being used as a physical format internally in many data formats.
>>
>> [image: image.png]
>>
>> For me, returning a Date as the output of avg(date) seems like a logical
>> choice. Internally it is handled as dates since epoch, which makes sense as
>> well:
>>
>> *Avro* it is milliseconds since epoch:
>>
>> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
>>
>> *Parquet* it is days since epoch:
>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
>>
>> *ORC* is based around days since Epoch:
>>
>> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
>>
>> Also with this, we keep parity with the Catalyst type :)
>>
>> Any further thoughts on this before moving forward?
>>
>> Kind regards, Fokko
>>
>>
>>

Re: Allow average out of a Date

Posted by Greg Rahn <gr...@gmail.com>.

Fokko-

I can certainly understand the ask for this behavior and it seems simple
and/or straightforward at face value -- as R's Date Class behaves this way,
however, I think there are several challenges involved here, as AFAIK,
there seems to be no well defined way SQL systems do this implicitly today
-- and I think there is a clear explanation of why.

As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if AVG()
works on DATE, then SUM() (and the plus operator) would need to work as
well -- but what is the definition of SUM() for a list of DATEs or DATE +
DATE? There isn't any such definition in the SQL world. And take a very
simple example -- what's the AVG() of today and yesterday? If you represent
these dates as their Julian integer value, the average of two
consecutive integers is a floating point number between them -- so now
what?  Should it be ceil() or floor()? Round down or round up? Should it be
yesterday or today?  It seems very challenging to deal with this in a
formal and standardized way.

It's my understanding that it's this ambiguity, and the need for an
intermediate numerical representation like the Julian Day Number, that
limits this from being an implicit SQL data type conversion operation and
why users are forced to choose their own explicit conversion (to Julian
date, for example) in order to do such mathematical operations. As we see
in the SQL:2016 standard cast specification and conversion table (below) --
there is no standardized way to implicitly cast between DATE and numeric
types because there is not just one numerical representation of a DATE.

If we look back at the use case from SPARK-10520 which is about R's summary
function on a vector of dates -- (and I don't claim to be an R internals
expert) but maybe it makes more sense to implement R's Date Class behavior
in SparkR vs more generically across multiple languages and types.  If
possible, this would seem to yield the desired behavior for the use case in
question, but not broaden the requirement so wide that it becomes too
challenging to implement and conflicts with other well defined standards,
like ANSI SQL:2016.

[image: Screen Shot 2020-08-21 at 12.00.01 PM.png]

For reference:
https://github.com/apache/spark/pull/28754
https://issues.apache.org/jira/browse/SPARK-10520

On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Hi all,
>
> Personally, I'm a big fan of the .summary() function to compute statistics
> of a dataframe. I often use this for debugging pipelines, and check what
> the impact of the RDD is after changing code.
>
> I've noticed that not all datatypes are in this summary. Currently, there
> is a list
> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L267>
> of types that allowed to be included in the summary, and I love to extend
> that list.
>
> The first one is the date type. It is important to define this together
> with the community, and that we get consensus, as this will be part of the
> public API. Changing this will be costly (or maybe impossible) to do.
>
> I've checked what other DBMS'es do with averages out of dates:
>
> Postgres
>
> Unsupported:
>
> postgres@366ecc8a0fb9:/$ psql
> psql (12.3 (Debian 12.3-1.pgdg100+1))
> Type "help" for help.
>
> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
> ERROR:  cannot cast type date to numeric
> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>                ^
>
> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
> ERROR:  cannot cast type date to integer
> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>                ^
>
> The way to get the epoch in days is:
>
> postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
> date_part
> -----------
>     18422
> (1 row)
>
>
> MySQL
>
> Converts to a YYYYMMDD format:
>
> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
> +---------------------------------------------+
> | CAST(CAST('2020-01-01' AS DATE) AS decimal) |
> +---------------------------------------------+
> |                                    20200101 |
> +---------------------------------------------+
> 1 row in set (0.00 sec)
>
> However, converting to an int, isn't allowed:
>
> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1
>
> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1
>
> Bigquery
>
> Unsupported:
>
> [image: image.png]
>
> Excel
>
> Converts it to the days since epoch. This feels weird, but I can see it,
> as it is being used as a physical format internally in many data formats.
>
> [image: image.png]
>
> For me, returning a Date as the output of avg(date) seems like a logical
> choice. Internally it is handled as dates since epoch, which makes sense as
> well:
>
> *Avro* it is milliseconds since epoch:
>
> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
>
> *Parquet* it is days since epoch:
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
>
> *ORC* is based around days since Epoch:
>
> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
>
> Also with this, we keep parity with the Catalyst type :)
>
> Any further thoughts on this before moving forward?
>
> Kind regards, Fokko
>
>
>