You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nacho García Fernández (JIRA)" <ji...@apache.org> on 2018/01/23 13:32:00 UTC
[jira] [Updated] (SPARK-23190) Error when infering date columns

     [ https://issues.apache.org/jira/browse/SPARK-23190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nacho García Fernández updated SPARK-23190:
-------------------------------------------
    Description: 
Hi.

I'm trying to read the following file using the spark.sql read utility:

 

 
{code:java}
c1;c2;c3;c4;c5 
"+0000000.";"2";"x";"20001122";2000
"-0000010.21";"2";"x";"19991222";2000 
"+0000113.34";"00";"v";"20001022";2000 
"+0000000.";"0";"a";"20120322";2000
{code}
 

I'm doing this in the spark-shell using the following command: 

 
{code:java}
spark.sqlContext.read.option("inferSchema", "true").option("header", "true").option("delimiter", ";").option("timestampFormat","yyyyMMdd").csv("myfile.csv").printSchema
{code}
and I'm getting the following schema:

 
{code:java}
root 
 – c1: double (nullable = true)
 – c2: integer (nullable = true)
 – c3: string (nullable = true)
 – c4: integer (nullable = true)
 – c5: integer (nullable = true)
{code}
 

As you can see, the column c4 is being infered as Integer, instead of Timestamp. I think this is due to the order used in the following match clause: 

[https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L87]

 Since my date  consists only of decimal values, it is being infered as Integer.  Would be correct to change the order in the match clause and to give preference to Timestamps? I think this is not good in terms of performance, since all the interger values would be tried to be casted to timestamps, but I also think that the current implementation is not valid for dates with are only based on digits.

 

 

Thanks in advance.

 

 

 

 

 

 

  was:
Hi.

I'm trying to read the following file using the spark.sql read utility:

```

c1;c2;c3;c4;c5
"+0000000.";"2";"x";"20001122";2000
"-0000010.21";"2";"x";"19991222";2000
"+0000113.34";"00";"v";"20001022";2000
"+0000000.";"0";"a";"20120322";2000

```

I'm doing this in the spark-shell using the following command:

````

 spark.sqlContext.read.option("inferSchema", "true").option("header", "true").option("delimiter", ";").option("timestampFormat","yyyyMMdd").csv("myfile.csv").printSchema

`````

and I'm getting the following schema:

`````

root
 |-- c1: double (nullable = true)
 |-- c2: integer (nullable = true)
 |-- c3: string (nullable = true)
 |-- c4: integer (nullable = true)
 |-- c5: integer (nullable = true)

`````

As you can see, the column c4 is being infered as Integer, instead of Timestamp. I think this is due to the order used in the following match clause: 

[https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L87]

 

Since my date  consists only of decimal values, it is being infered as Integer.  Would be correct to change the order in the match clause and give preference to Timestamps? I think this is not good in terms of performance, since all the interger values would be tried to cast to timestamps, but I also think that the current implementation is not valid for dates with are fully based on digits...

 

 

 

 

 

 


> Error when infering date columns
> --------------------------------
>
>                 Key: SPARK-23190
>                 URL: https://issues.apache.org/jira/browse/SPARK-23190
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.1, 2.1.2, 2.2.1
>            Reporter: Nacho García Fernández
>            Priority: Major
>
> Hi.
> I'm trying to read the following file using the spark.sql read utility:
>  
>  
> {code:java}
> c1;c2;c3;c4;c5 
> "+0000000.";"2";"x";"20001122";2000
> "-0000010.21";"2";"x";"19991222";2000 
> "+0000113.34";"00";"v";"20001022";2000 
> "+0000000.";"0";"a";"20120322";2000
> {code}
>  
> I'm doing this in the spark-shell using the following command: 
>  
> {code:java}
> spark.sqlContext.read.option("inferSchema", "true").option("header", "true").option("delimiter", ";").option("timestampFormat","yyyyMMdd").csv("myfile.csv").printSchema
> {code}
> and I'm getting the following schema:
>  
> {code:java}
> root 
>  – c1: double (nullable = true)
>  – c2: integer (nullable = true)
>  – c3: string (nullable = true)
>  – c4: integer (nullable = true)
>  – c5: integer (nullable = true)
> {code}
>  
> As you can see, the column c4 is being infered as Integer, instead of Timestamp. I think this is due to the order used in the following match clause: 
> [https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L87]
>  Since my date  consists only of decimal values, it is being infered as Integer.  Would be correct to change the order in the match clause and to give preference to Timestamps? I think this is not good in terms of performance, since all the interger values would be tried to be casted to timestamps, but I also think that the current implementation is not valid for dates with are only based on digits.
>  
>  
> Thanks in advance.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org