You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2018/01/23 13:31:00 UTC
[jira] [Commented] (SPARK-23190) Error when infering date columns

    [ https://issues.apache.org/jira/browse/SPARK-23190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335775#comment-16335775 ] 

Sean Owen commented on SPARK-23190:
-----------------------------------

No, it would cause more serious problems. Suddenly integers that happened to look like dates would become dates. 

You can impose your own schema in cases like this.

> Error when infering date columns
> --------------------------------
>
>                 Key: SPARK-23190
>                 URL: https://issues.apache.org/jira/browse/SPARK-23190
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.1, 2.1.2, 2.2.1
>            Reporter: Nacho García Fernández
>            Priority: Major
>
> Hi.
> I'm trying to read the following file using the spark.sql read utility:
> ```
> c1;c2;c3;c4;c5
> "+0000000.";"2";"x";"20001122";2000
> "-0000010.21";"2";"x";"19991222";2000
> "+0000113.34";"00";"v";"20001022";2000
> "+0000000.";"0";"a";"20120322";2000
> ```
> I'm doing this in the spark-shell using the following command:
> ````
>  spark.sqlContext.read.option("inferSchema", "true").option("header", "true").option("delimiter", ";").option("timestampFormat","yyyyMMdd").csv("myfile.csv").printSchema
> `````
> and I'm getting the following schema:
> `````
> root
>  |-- c1: double (nullable = true)
>  |-- c2: integer (nullable = true)
>  |-- c3: string (nullable = true)
>  |-- c4: integer (nullable = true)
>  |-- c5: integer (nullable = true)
> `````
> As you can see, the column c4 is being infered as Integer, instead of Timestamp. I think this is due to the order used in the following match clause: 
> [https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L87]
>  
> Since my date  consists only of decimal values, it is being infered as Integer.  Would be correct to change the order in the match clause and give preference to Timestamps? I think this is not good in terms of performance, since all the interger values would be tried to cast to timestamps, but I also think that the current implementation is not valid for dates with are fully based on digits...
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org