You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yin Huai (JIRA)" <ji...@apache.org> on 2015/09/09 20:01:46 UTC
[jira] [Created] (SPARK-10519) Investigate if we should encode
timezone information to a timestamp value stored in JSON
Yin Huai created SPARK-10519:
--------------------------------
Summary: Investigate if we should encode timezone information to a timestamp value stored in JSON
Key: SPARK-10519
URL: https://issues.apache.org/jira/browse/SPARK-10519
Project: Spark
Issue Type: Task
Components: SQL
Reporter: Yin Huai
Priority: Minor
Since Spark 1.3, we store a timestamp in JSON without encoding the timezone information and the string representation of a timestamp stored in JSON implicitly using the local timezone (see [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454], [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38], [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41], [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]). This behavior may cause the data consumers got different values when they are in a different timezone with the data producers.
Since JSON is string based, if we encode timezone information to timestamp value, downstream applications may need to change their code (for example, java.sql.Timestamp.valueOf only supports the format of {{yyyy-\[m]m-\[d]d hh:mm:ss\[.f...]}}).
We should investigate what we should do about this issue. Right now, I can think of three options:
1. Encoding timezone info in the timestamp value, which can break user code and may change the semantic of timestamp (our timestamp value is timezone-less).
2. When saving a timestamp value to json, we treat this value as a value in the local timezone and convert it to UTC time. Then, when save the data, we do not encode timezone info in the value.
3. We do not change our current behavior. But, in our doc, we explicitly say that users need to use a single timezone for their datasets (e.g. always use UTC time).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org