You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Will Zimmerman (Jira)" <ji...@apache.org> on 2021/04/02 21:27:00 UTC

[jira] [Commented] (SPARK-27450) Timestamp cast fails when the ISO8601 string omits minutes, seconds or milliseconds

    [ https://issues.apache.org/jira/browse/SPARK-27450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314093#comment-17314093 ] 

Will Zimmerman commented on SPARK-27450:
----------------------------------------

[~maxgekk] the proposed solution allows for handling of a specific use case, but it doesn't allow a higher level of time specificity to make it through. 

 

The issue is that a valid [ISO 8601|https://en.wikipedia.org/wiki/ISO_8601#Times] format isn't able to be interpreted by the built-in [timestamp/to_timestamp|https://spark.apache.org/docs/latest/api/sql/index.html#to_timestamp] Spark function. As can be seen in the example below, the fact that time zone designator is present causes the transformation to become Null.
{code:java}
spark.sql(F"""
SELECT
    TO_TIMESTAMP('2020-05-19 10:30Z') AS timestamp_1
   ,TO_TIMESTAMP('2020-05-19 10:30') AS timestamp_2
""").show(truncate=False)

# Result
+-----------+-------------------+
|timestamp_1|timestamp_2        |
+-----------+-------------------+
|null       |2020-05-19 10:30:00|
+-----------+-------------------+{code}
The proposed workaround would take the specified formatting as you described above, but it would only be applicable to that specific format and not the other valid ISO 8601 formats.

 
{code:java}
spark.sql(F"""
SELECT
    TO_TIMESTAMP('2020-05-19 10:30Z', 'yyyy-MM-dd HH:mmX') AS timestamp_1
   ,TO_TIMESTAMP('2020-05-19 10:30:19Z', 'yyyy-MM-dd HH:mmX') AS timestamp_2
   ,TO_TIMESTAMP('2020-05-19 10:30:19.001Z', 'yyyy-MM-dd HH:mmX') AS timestamp_3
""").show(truncate=False)

# Result
+-------------------+-----------+-----------+
|timestamp_1        |timestamp_2|timestamp_3|
+-------------------+-----------+-----------+
|2020-05-19 10:30:00|null       |null       |
+-------------------+-----------+-----------+
{code}
The other option I tried below ended up cutting off any additional time granularity past the specified format. This behavior makes sense, but it is problematic for data that has multiple degrees of time granularity.

 

 
{code:java}
spark.sql(F"""
SELECT
    TO_TIMESTAMP('2020-05-19 10:30Z', 'yyyy-MM-dd HH:mm') AS timestamp_1
   ,TO_TIMESTAMP('2020-05-19 10:30:19Z', 'yyyy-MM-dd HH:mm') AS timestamp_2
   ,TO_TIMESTAMP('2020-05-19 10:30:19.001Z', 'yyyy-MM-dd HH:mm') AS timestamp_3
""").show(truncate=False)

# Result
+-------------------+-------------------+-------------------+
|timestamp_1        |timestamp_2        |timestamp_3        |
+-------------------+-------------------+-------------------+
|2020-05-19 10:30:00|2020-05-19 10:30:00|2020-05-19 10:30:00|
+-------------------+-------------------+-------------------+
{code}
The only workaround I've found that would work in any of the ISO 8601 formats is used in the following.

 

 
{code:java}
spark.sql(F"""
SELECT
    TO_TIMESTAMP(REPLACE('2020-05-19 10:30Z', 'Z', '')) AS timestamp_1
   ,TO_TIMESTAMP(REPLACE('2020-05-19 10:30:19Z', 'Z', '')) AS timestamp_2
   ,TO_TIMESTAMP(REPLACE('2020-05-19 10:30:19.001Z', 'Z', '')) AS timestamp_3
""").show(truncate=False)

# Result
+-------------------+-------------------+-----------------------+
|timestamp_1        |timestamp_2        |timestamp_3            |
+-------------------+-------------------+-----------------------+
|2020-05-19 10:30:00|2020-05-19 10:30:19|2020-05-19 10:30:19.001|
+-------------------+-------------------+-----------------------+
{code}
Is it possible to be able to have the built-in Spark function to have the "yyyy-MM-dd HH:mmX" format added to the existing accepted formats?

 

> Timestamp cast fails when the ISO8601 string omits minutes, seconds or milliseconds
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-27450
>                 URL: https://issues.apache.org/jira/browse/SPARK-27450
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>         Environment: Spark 2.3.x
>            Reporter: Leandro Rosa
>            Priority: Major
>
> ISO8601 allows to omit minutes, seconds and milliseconds.
> {quote}
> |hh:mm:ss.sss|_or_|hhmmss.sss|
> |hh:mm:ss|_or_|hhmmss|
> |hh:mm|_or_|hhmm|
> | |hh|
> {quote}
> {quote}Either the seconds, or the minutes and seconds, may be omitted from the basic or extended time formats for greater brevity but decreased accuracy: [hh]:[mm], [hh][mm] and [hh] are the resulting reduced accuracy time formats
> {quote}
> Source: [Wikipedia ISO8601|https://en.wikipedia.org/wiki/ISO_8601]
> Popular libs, such as [ZonedDateTime|https://docs.oracle.com/javase/8/docs/api/java/time/ZonedDateTime.html], respect that. However, Timestamp cast fails silently.
>  
> {code:java}
> import org.apache.spark.sql.types._
> val df1 = Seq(("2017-08-01T02:33")).toDF("eventTimeString") // NON-ISO8601 (missing TZ offset) [OK]
> val new_df1 = df1
> .withColumn("eventTimeTS", col("eventTimeString").cast(TimestampType))
> new_df1.show(false)
> +----------------+-------------------+
> |eventTimeString |eventTimeTS |
> +----------------+-------------------+
> |2017-08-01T02:33|2017-08-01 02:33:00|
> +----------------+-------------------+
> {code}
> {code:java}
> val df2 = Seq(("2017-08-01T02:33Z")).toDF("eventTimeString") // ISO8601 [FAIL]
> val new_df2 = df2
> .withColumn("eventTimeTS", col("eventTimeString").cast(TimestampType))
> new_df2.show(false)
> +-----------------+-----------+
> |eventTimeString |eventTimeTS|
> +-----------------+-----------+
> |2017-08-01T02:33Z|null |
> +-----------------+-----------+
> {code}
>  
> {code:java}
> val df3 = Seq(("2017-08-01T02:33-03:00")).toDF("eventTimeString") // ISO8601 [FAIL]
> val new_df3 = df3
> .withColumn("eventTimeTS", col("eventTimeString").cast(TimestampType))
> new_df3.show(false)
> +----------------------+-----------+
> |eventTimeString |eventTimeTS|
> +----------------------+-----------+
> |2017-08-01T02:33-03:00|null |
> +----------------------+-----------+
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org