You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Anupam Jain (Jira)" <ji...@apache.org> on 2020/06/17 18:30:00 UTC
[jira] [Created] (SPARK-32016) Why spark does not preserve the
original timestamp format while writing dataset to file or hdfs
Anupam Jain created SPARK-32016:
-----------------------------------
Summary: Why spark does not preserve the original timestamp format while writing dataset to file or hdfs
Key: SPARK-32016
URL: https://issues.apache.org/jira/browse/SPARK-32016
Project: Spark
Issue Type: Bug
Components: SQL, Structured Streaming
Affects Versions: 2.4.3, 2.4.0, 2.3.0
Environment: Apache spark 2.3 and spark 2.4. May happen in other as well
Reporter: Anupam Jain
Fix For: 3.0.0
Want to write spark dataset having few timestamp columns into hdfs.
* While reading, by default spark infers data as timestamp, if format is similar to "*yyyy-MM-dd HH:mm:ss*".
* But while writing to file, saves in format as "*yyyy-MM-dd'T'HH:mm:ss.SSSXXX*"
* For e.g. source data *2020-06-01 12:10:03* is written as *2020-06-01T12:10:03.000+05:30*.
* Expected is to preserve the oroginal timestamp format before writing.
Why spark does not preserve the original timestamp format while writing dataset to file or hdfs?
Using simple java code like:
{color:#4c9aff}Dataset<Row> ds = spark.read().format("csv").option("path",the_path).option("inferSchema","true").load(); {color}
{color:#4c9aff}ds.write().format("csv").save("path_to_save");{color}
I know the workaround:
* Use "*timestampFormat*" option before save.
* But may have performance overhead and also its global for all columns.
* So lets say have 2 columns having formats "*yyyy-MM-dd HH:mm:ss*" and "*yyyy-MM-dd HH*". Both can be inferred as timestamp by default, but outputs in a single specified "timestampFormat".
* Another way is to use date_format(col, format). But that also may have performance overhead and includes operations to apply, whereas I expect spark to preserve the original format
Tried with spark2.3 and spark2.4
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org