You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Barry Becker (JIRA)" <ji...@apache.org> on 2016/08/15 21:30:20 UTC

[jira] [Updated] (SPARK-17066) dateFormat should be used when writing dataframes as csv files

     [ https://issues.apache.org/jira/browse/SPARK-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Barry Becker updated SPARK-17066:
---------------------------------
    Description: 
I noticed this when running tests after pulling and building @lw-lin 's PR (https://github.com/apache/spark/pull/14118). I don't think it is anything wrong with his PR, just that the fix that was made to spark-csv for this issue was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 was fixed in spark-csv after that merge.

The problem is that if I try to write a dataframe that contains a date column out to a csv using something like this

repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
        .option("delimiter", "\t")
        .option("header", "false")
        .option("nullValue", "?")
        .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss")
        .option("escape", "\\")       
        .save(tempFileName)

Then my unit test (which passed under spark 1.6.2) fails using the spark 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a date column.

Expected "[2012-01-03T09:12:00
?
2015-02-23T18:00:]00", 
but got 
"[1325610720000000
?
14247432000000]00"

This means that while the null value is being correctly exported, the specified dateFormat is not being used to format the date. Instead it looks like number of seconds from epoch is being used.

  was:
I noticed this when running tests after pulling and building @lw-lin 's PR (https://github.com/apache/spark/pull/14118). I don't think it is anything wrong with his PR, just that the fix that was made to spark-csv for this issue was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 was fixed in spark-csv after that merge.

The problem is that if I try to write a dataframe that contains a date column out to a csv using something like this

repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
        .option("delimiter", "\t")
        .option("header", "false")
        .option("nullValue", "?")
        .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss")
        .option("escape", "\\")       
        .save(tempFileName)

Then my unit test (which passed under spark 1.6.2 fails using the spark 2.1.0 snapshot build that I made today.

Expected "[2012-01-03T09:12:00
?
2015-02-23T18:00:]00", 
but got 
"[1325610720000000
?
14247432000000]00"

This means that while the null value is being correctly exported, the specified dateFormat is not being used to format the date. Instead it looks like number of seconds from epoch is being used.


> dateFormat should be used when writing dataframes as csv files
> --------------------------------------------------------------
>
>                 Key: SPARK-17066
>                 URL: https://issues.apache.org/jira/browse/SPARK-17066
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.0.0
>            Reporter: Barry Becker
>
> I noticed this when running tests after pulling and building @lw-lin 's PR (https://github.com/apache/spark/pull/14118). I don't think it is anything wrong with his PR, just that the fix that was made to spark-csv for this issue was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 was fixed in spark-csv after that merge.
> The problem is that if I try to write a dataframe that contains a date column out to a csv using something like this
> repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
>         .option("delimiter", "\t")
>         .option("header", "false")
>         .option("nullValue", "?")
>         .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss")
>         .option("escape", "\\")       
>         .save(tempFileName)
> Then my unit test (which passed under spark 1.6.2) fails using the spark 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a date column.
> Expected "[2012-01-03T09:12:00
> ?
> 2015-02-23T18:00:]00", 
> but got 
> "[1325610720000000
> ?
> 14247432000000]00"
> This means that while the null value is being correctly exported, the specified dateFormat is not being used to format the date. Instead it looks like number of seconds from epoch is being used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org