You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (Jira)" <ji...@apache.org> on 2020/03/19 06:49:00 UTC

[jira] [Assigned] (SPARK-31159) Incompatible Parquet dates/timestamps with Spark 2.4

     [ https://issues.apache.org/jira/browse/SPARK-31159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wenchen Fan reassigned SPARK-31159:
-----------------------------------

    Assignee: Maxim Gekk

> Incompatible Parquet dates/timestamps with Spark 2.4
> ----------------------------------------------------
>
>                 Key: SPARK-31159
>                 URL: https://issues.apache.org/jira/browse/SPARK-31159
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Maxim Gekk
>            Assignee: Maxim Gekk
>            Priority: Major
>
> Write dates/timestamps to Parquet file in Spark 2.4:
> {code}
> $ export TZ="UTC"
> $ ~/spark-2.4/bin/spark-shell
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
>       /_/
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
> scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts"))
> df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp]
> scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
> scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
> scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
> scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
> +----------+--------------------------+
> |d         |ts                        |
> +----------+--------------------------+
> |1001-01-01|1001-01-01 01:02:03.123456|
> +----------+--------------------------+
> {code}
> Spark 2.4 saves dates/timestamps in Julian calendar. The parquet-mr tool prints *1001-01-07* and *1001-01-07T01:02:03.123456+0000*:
> {code}
> $ java -jar /Users/maxim/proj/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar dump -m ./2_4_5_micros/part-00000-fe310bfa-0f61-44af-85ee-489721042c14-c000.snappy.parquet
> INT32 d
> --------------------------------------------------------------------------------
> *** row group 1 of 1, values 1 to 1 ***
> value 1: R:0 D:1 V:1001-01-07
> INT64 ts
> --------------------------------------------------------------------------------
> *** row group 1 of 1, values 1 to 1 ***
> value 1: R:0 D:1 V:1001-01-07T01:02:03.123456+0000
> {code}
> Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) prints the same as parquet-mr but different values from Spark 2.4:
> {code}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
>       /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
> scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
> +----------+--------------------------+
> |d         |ts                        |
> +----------+--------------------------+
> |1001-01-07|1001-01-07 01:02:03.123456|
> +----------+--------------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org