You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nacho García Fernández (JIRA)" <ji...@apache.org> on 2018/01/23 13:29:00 UTC

[jira] [Created] (SPARK-23190) Error when infering date columns

Nacho García Fernández created SPARK-23190:
----------------------------------------------

             Summary: Error when infering date columns
                 Key: SPARK-23190
                 URL: https://issues.apache.org/jira/browse/SPARK-23190
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.2.1, 2.1.2, 2.1.1
            Reporter: Nacho García Fernández


Hi.

I'm trying to read the following file using the spark.sql read utility:

```

c1;c2;c3;c4;c5
"+0000000.";"2";"x";"20001122";2000
"-0000010.21";"2";"x";"19991222";2000
"+0000113.34";"00";"v";"20001022";2000
"+0000000.";"0";"a";"20120322";2000

```

I'm doing this in the spark-shell using the following command:

````

 spark.sqlContext.read.option("inferSchema", "true").option("header", "true").option("delimiter", ";").option("timestampFormat","yyyyMMdd").csv("myfile.csv").printSchema

`````

and I'm getting the following schema:

`````

root
 |-- c1: double (nullable = true)
 |-- c2: integer (nullable = true)
 |-- c3: string (nullable = true)
 |-- c4: integer (nullable = true)
 |-- c5: integer (nullable = true)

`````

As you can see, the column c4 is being infered as Integer, instead of Timestamp. I think this is due to the order used in the following match clause: 

[https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L87]

 

Since my date  consists only of decimal values, it is being infered as Integer.  Would be correct to change the order in the match clause and give preference to Timestamps? I think this is not good in terms of performance, since all the interger values would be tried to cast to timestamps, but I also think that the current implementation is not valid for dates with are fully based on digits...

 

 

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org