You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Anupam Jain (Jira)" <ji...@apache.org> on 2020/06/17 18:30:00 UTC

[jira] [Created] (SPARK-32016) Why spark does not preserve the original timestamp format while writing dataset to file or hdfs

Anupam Jain created SPARK-32016:
-----------------------------------

             Summary: Why spark does not preserve the original timestamp format while writing dataset to file or hdfs
                 Key: SPARK-32016
                 URL: https://issues.apache.org/jira/browse/SPARK-32016
             Project: Spark
          Issue Type: Bug
          Components: SQL, Structured Streaming
    Affects Versions: 2.4.3, 2.4.0, 2.3.0
         Environment: Apache spark 2.3 and spark 2.4. May happen in other as well
            Reporter: Anupam Jain
             Fix For: 3.0.0


Want to write spark dataset having few timestamp columns into hdfs.
 * While reading, by default spark infers data as timestamp, if format is similar to "*yyyy-MM-dd HH:mm:ss*".
 * But while writing to file, saves in format as "*yyyy-MM-dd'T'HH:mm:ss.SSSXXX*"
 * For e.g. source data *2020-06-01 12:10:03* is written as *2020-06-01T12:10:03.000+05:30*.
 * Expected is to preserve the oroginal timestamp format before writing.

Why spark does not preserve the original timestamp format while writing dataset to file or hdfs?

Using simple java code like:

{color:#4c9aff}Dataset<Row> ds = spark.read().format("csv").option("path",the_path).option("inferSchema","true").load(); {color}
{color:#4c9aff}ds.write().format("csv").save("path_to_save");{color}

I know the workaround:
 * Use "*timestampFormat*" option before save.
 * But may have performance overhead and also its global for all columns.
 * So lets say have 2 columns having formats "*yyyy-MM-dd HH:mm:ss*" and "*yyyy-MM-dd HH*". Both can be inferred as timestamp by default, but outputs in a single specified "timestampFormat".
 * Another way is to use date_format(col, format). But that also may have performance overhead and includes operations to apply, whereas I expect spark to preserve the original format

Tried with spark2.3 and spark2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org