You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (Jira)" <ji...@apache.org> on 2020/03/19 04:53:00 UTC
[jira] [Updated] (SPARK-31159) Incompatible Parquet
dates/timestamps with Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-31159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan updated SPARK-31159:
--------------------------------
Parent: SPARK-30951
Issue Type: Sub-task (was: Bug)
> Incompatible Parquet dates/timestamps with Spark 2.4
> ----------------------------------------------------
>
> Key: SPARK-31159
> URL: https://issues.apache.org/jira/browse/SPARK-31159
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Maxim Gekk
> Priority: Major
>
> Write dates/timestamps to Parquet file in Spark 2.4:
> {code}
> $ export TZ="UTC"
> $ ~/spark-2.4/bin/spark-shell
> Welcome to
> ____ __
> / __/__ ___ _____/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.4.5
> /_/
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
> scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts"))
> df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp]
> scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
> scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
> scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
> scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
> +----------+--------------------------+
> |d |ts |
> +----------+--------------------------+
> |1001-01-01|1001-01-01 01:02:03.123456|
> +----------+--------------------------+
> {code}
> Spark 2.4 saves dates/timestamps in Julian calendar. The parquet-mr tool prints *1001-01-07* and *1001-01-07T01:02:03.123456+0000*:
> {code}
> $ java -jar /Users/maxim/proj/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar dump -m ./2_4_5_micros/part-00000-fe310bfa-0f61-44af-85ee-489721042c14-c000.snappy.parquet
> INT32 d
> --------------------------------------------------------------------------------
> *** row group 1 of 1, values 1 to 1 ***
> value 1: R:0 D:1 V:1001-01-07
> INT64 ts
> --------------------------------------------------------------------------------
> *** row group 1 of 1, values 1 to 1 ***
> value 1: R:0 D:1 V:1001-01-07T01:02:03.123456+0000
> {code}
> Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) prints the same as parquet-mr but different values from Spark 2.4:
> {code}
> Welcome to
> ____ __
> / __/__ ___ _____/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview2
> /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
> scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
> +----------+--------------------------+
> |d |ts |
> +----------+--------------------------+
> |1001-01-07|1001-01-07 01:02:03.123456|
> +----------+--------------------------+
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org