You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yin Huai (JIRA)" <ji...@apache.org> on 2015/09/09 20:01:46 UTC

[jira] [Created] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON

Yin Huai created SPARK-10519:
--------------------------------

             Summary: Investigate if we should encode timezone information to a timestamp value stored in JSON
                 Key: SPARK-10519
                 URL: https://issues.apache.org/jira/browse/SPARK-10519
             Project: Spark
          Issue Type: Task
          Components: SQL
            Reporter: Yin Huai
            Priority: Minor


Since Spark 1.3, we store a timestamp in JSON without encoding the timezone information and the string representation of a timestamp stored in JSON implicitly using the local timezone (see [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454], [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38], [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41], [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]). This behavior may cause the data consumers got different values when they are in a different timezone with the data producers.

Since JSON is string based, if we encode timezone information to timestamp value, downstream applications may need to change their code (for example, java.sql.Timestamp.valueOf only supports the format of {{yyyy-\[m]m-\[d]d hh:mm:ss\[.f...]}}).

We should investigate what we should do about this issue. Right now, I can think of three options:

1. Encoding timezone info in the timestamp value, which can break user code and may change the semantic of timestamp (our timestamp value is timezone-less).
2. When saving a timestamp value to json, we treat this value as a value in the local timezone and convert it to UTC time. Then, when save the data, we do not encode timezone info in the value.
3. We do not change our current behavior. But, in our doc, we explicitly say that users need to use a single timezone for their datasets (e.g. always use UTC time). 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org