You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maxim Gekk (Jira)" <ji...@apache.org> on 2020/02/09 21:09:00 UTC

[jira] [Comment Edited] (SPARK-30767) from_json changes times of timestmaps by several minutes without error

    [ https://issues.apache.org/jira/browse/SPARK-30767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033293#comment-17033293 ] 

Maxim Gekk edited comment on SPARK-30767 at 2/9/20 9:08 PM:
------------------------------------------------------------

The default timestamp pattern in JSON datasource specifies only milliseconds but your input strings have timestamps in microsecond precision. You can change the pattern via:
{code:scala}
from_json(col("json"), struct, Map("timestampFormat" -> "uuuu-MM-dd'T'HH:mm:ss.SSSSSSXXX")
{code}
Just in case, it should work in Spark 3.0 preview and in Spark 2.4.5


was (Author: maxgekk):
The default timestamp pattern in JSON datasource specifies only milliseconds but your input strings have timestamps in microsecond precision. You can change the pattern via:
{code:scala}
from_json(col("json"), struct, Map("timestampFormat" -> "uuuu-MM-dd'T'HH:mm:ss.SSSSSSXXX")
{code}

> from_json changes times of timestmaps by several minutes without error 
> -----------------------------------------------------------------------
>
>                 Key: SPARK-30767
>                 URL: https://issues.apache.org/jira/browse/SPARK-30767
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.4
>         Environment: We ran the example code with Spark 2.4.4 via Azure Databricks with Databricks Runtime version 6.3 within an interactive cluster. We encountered the issue first on a Job Cluster running a streaming application on Databricks Runtime Version 5.4.
>            Reporter: Benedikt Maria Beckermann
>            Priority: Major
>              Labels: corruption
>
> When a json text column includes a timestamp and the timestamp has a format like {{2020-01-25T06:39:45.887429Z}}, the function {{from_json(Column,StructType)}} is able to infer a timestamp but that timestamp is changed by several minutes. 
> Spark does not throw any kind of error but continues to run with the invalidated timestamp. 
> The following scala snipped is able to reproduce the issue.
>  
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json")
> val struct = new StructType().add("time", TimestampType, nullable = true)
> val timeDF = df
>   .withColumn("time (string)", get_json_object(col("json"), "$.time"))
>   .withColumn("time casted directly (CORRECT)", col("time (string)").cast(TimestampType))
>   .withColumn("time casted via struct (INVALID)", from_json(col("json"), struct))
> display(timeDF)
> {code}
> Output: 
> ||json||time (string)||time casted directly (CORRECT)||time casted via struct (INVALID)
> |{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+0000|{"time":"2020-01-25T06:54:32.429+0000"}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org