You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andy Grove (Jira)" <ji...@apache.org> on 2023/10/05 17:26:00 UTC

[jira] [Created] (SPARK-45424) Regression in CSV schema inference when timestampFormat is specified

Andy Grove created SPARK-45424:
----------------------------------

             Summary: Regression in CSV schema inference when timestampFormat is specified
                 Key: SPARK-45424
                 URL: https://issues.apache.org/jira/browse/SPARK-45424
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.0
            Reporter: Andy Grove


There is a regression in Spark 3.5.0 when inferring the schema of files containing timestamps, where a column will be inferred as a timestamp even if the contents do not match the specified timestampFormat.

*Test Data*

I have the following csv file:
{code:java}
2884-06-24T02:45:51.138
2884-06-24T02:45:51.138
2884-06-24T02:45:51.138
{code}
*Spark 3.4.0 Behavior (correct)*

In Spark 3.4.0, if I specify the correct timestamp format, then the schema is inferred as timestamp:
{code:java}
scala> val df = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS").option("inferSchema", true).csv("/tmp/timestamps.csv")
df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
{code}
If I specify an incompatible timestampFormat, then the schema is inferred as string:
{code:java}
scala> val df = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("inferSchema", true).csv("/tmp/timestamps.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string]
{code}
*Spark 3.5.0*

In Spark 3.5.0, the column will be inferred as timestamp even if the data does not match the specified timestampFormat.
{code:java}
scala> val df = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("inferSchema", true).csv("/tmp/timestamps.csv")
df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
{code}
Reading the DataFrame then results in an error:
{code:java}
Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org