You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuming Wang (JIRA)" <ji...@apache.org> on 2017/09/24 01:36:01 UTC
[jira] [Commented] (SPARK-22108) Logical Inconsistency in Timestamp Cast

    [ https://issues.apache.org/jira/browse/SPARK-22108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178039#comment-16178039 ] 

Yuming Wang commented on SPARK-22108:
-------------------------------------

https://issues.apache.org/jira/browse/SPARK-22070?focusedCommentId=16174212&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16174212

> Logical Inconsistency in Timestamp Cast
> ---------------------------------------
>
>                 Key: SPARK-22108
>                 URL: https://issues.apache.org/jira/browse/SPARK-22108
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Imran Rashid
>
> When Spark compares a Timestamp and a String, the cast depends on the type of operation.  Equality casts the string to a timestamp; other comparisons, eg. <= or >= cast the timestamp to a String.  This can lead to logical inconsistencies, where {{a == b}} but not {{a <= b}}.  
> The difference appears to be intentional, from SPARK-8420 : https://github.com/apache/spark/commit/a333a72e029d2546a66b36d6b3458e965430c530
> {noformat}
> ... we casted `TimestampType` and `DataType` to `StringType` when it was involved
> in a binary comparison with a `StringType`.  This allowed comparing a timestamp with
>  a partial date as a user would expect.
>  - `time > "2014-06-10"`
>  - `time > "2014"`
> {noformat}
> However, this motivation seems flawed; casting to Strings also leads to other unnatural comparisons, eg. "2015" < "5" by string comparison -- this extra test case in [{{DateFunctionsSuite."timestamp comparison with date strings"}}|https://github.com/apache/spark/blob/3e6a714c9ee97ef13b3f2010babded3b63fd9d74/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala#L76] would fail:
> {code}
>     checkAnswer(
>       df.select("t").filter($"t" >= "5"),
>       Row(Date.valueOf("2015-01-01")) :: Nil)
> {code}
> Here's an example showing the logical inconsistency:
> {noformat}
> scala> spark.sql("""select
> cast('2015-01-01' as timestamp) >= '2015',
> cast('2015-01-01' as timestamp) <= '2015',
> cast('2015-01-01' as timestamp) = '2015'""").show()
> +-------------------------------------------------------
> +-------------------------------------------------------
> +---------------------------------------------------------+
> |(CAST(CAST(2015-01-01 AS TIMESTAMP) AS STRING) >= 2015)
> |(CAST(CAST(2015-01-01 AS TIMESTAMP) AS STRING) <= 2015)
> |(CAST(2015-01-01 AS TIMESTAMP) = CAST(2015 AS TIMESTAMP))|
> +-------------------------------------------------------
> +-------------------------------------------------------
> +---------------------------------------------------------+
> |                                                   true
> |                                                  false
> |                                                     true|
> +-------------------------------------------------------
> +-------------------------------------------------------
> +---------------------------------------------------------+
> {noformat}
> FWIW, I looked at the sql spec and other engines.  As near as I can tell, the SQL spec does not specify what implicit casts should be applied (but I could be wrong).  Hive always returns null.  Postgres has a few more cases: for literal inputs, it doesn't accept anything that is just a year, eg "2015".  And when joining two different tables, it forces the user to explicitly specify the cast.
> {noformat}
> psql (9.6.2)
> Type "help" for help.
> irashid=# select
> irashid-# cast('2015-01-01' as timestamp) >= '2015',
> irashid-# cast('2015-01-01' as timestamp) <= '2015',
> irashid-# cast('2015-01-01' as timestamp) = '2015'
> irashid-# ;
> ERROR:  invalid input syntax for type timestamp: "2015"
> LINE 2: cast('2015-01-01' as timestamp) >= '2015',
>                                            ^
> irashid=# select
> irashid-# cast('2015-01-01' as timestamp) >= '2015-01-01',
> irashid-# cast('2015-01-01' as timestamp) <= '2015-01-01',
> irashid-# cast('2015-01-01' as timestamp) = '2015-01-01'
> irashid-# ;
>  ?column? | ?column? | ?column? 
> ----------+----------+----------
>  t        | t        | t
> (1 row)
> irashid=# create table cast_test_ts (x varchar, ts timestamp);
> CREATE TABLE
> irashid=# insert into cast_test_ts (x, ts) values ('2015-01-01', '2015-01-01');
> INSERT 0 1
> irashid=# select * from cast_test_ts;
>      x      |         ts          
> ------------+---------------------
>  2015-01-01 | 2015-01-01 00:00:00
> (1 row)
> irashid=# select a.x, b.ts from cast_test_ts a join cast_test_ts b on (a.x = b.ts);
> ERROR:  operator does not exist: character varying = timestamp without time zone
> LINE 1: ....ts from cast_test_ts a join cast_test_ts b on (a.x = b.ts);
>                                                                ^
> HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.
> irashid=# 
> irashid=# select a.x, b.ts from cast_test_ts a join cast_test_ts b on (cast(a.x as timestamp) = b.ts);
>      x      |         ts          
> ------------+---------------------
>  2015-01-01 | 2015-01-01 00:00:00
> (1 row)
> irashid=# select a.x, b.ts from cast_test_ts a join cast_test_ts b on (a.x = cast(b.ts as varchar));
>  x | ts 
> ---+----
> (0 rows)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org