You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nacho García Fernández (JIRA)" <ji...@apache.org> on 2018/01/23 13:29:00 UTC
[jira] [Created] (SPARK-23190) Error when infering date columns
Nacho García Fernández created SPARK-23190:
----------------------------------------------
Summary: Error when infering date columns
Key: SPARK-23190
URL: https://issues.apache.org/jira/browse/SPARK-23190
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.2.1, 2.1.2, 2.1.1
Reporter: Nacho García Fernández
Hi.
I'm trying to read the following file using the spark.sql read utility:
```
c1;c2;c3;c4;c5
"+0000000.";"2";"x";"20001122";2000
"-0000010.21";"2";"x";"19991222";2000
"+0000113.34";"00";"v";"20001022";2000
"+0000000.";"0";"a";"20120322";2000
```
I'm doing this in the spark-shell using the following command:
````
spark.sqlContext.read.option("inferSchema", "true").option("header", "true").option("delimiter", ";").option("timestampFormat","yyyyMMdd").csv("myfile.csv").printSchema
`````
and I'm getting the following schema:
`````
root
|-- c1: double (nullable = true)
|-- c2: integer (nullable = true)
|-- c3: string (nullable = true)
|-- c4: integer (nullable = true)
|-- c5: integer (nullable = true)
`````
As you can see, the column c4 is being infered as Integer, instead of Timestamp. I think this is due to the order used in the following match clause:
[https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L87]
Since my date consists only of decimal values, it is being infered as Integer. Would be correct to change the order in the match clause and give preference to Timestamps? I think this is not good in terms of performance, since all the interger values would be tried to cast to timestamps, but I also think that the current implementation is not valid for dates with are fully based on digits...
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org