You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2018/09/25 02:38:00 UTC
[jira] [Commented] (SPARK-25517) Spark DataFrame option inferSchema="true", dataFormat=MM/dd/yyyy, fails to detect date type from the csv file while reading

    [ https://issues.apache.org/jira/browse/SPARK-25517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626683#comment-16626683 ] 

Apache Spark commented on SPARK-25517:
--------------------------------------

User 'softmanu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22539

> Spark DataFrame option inferSchema="true", dataFormat=MM/dd/yyyy, fails to detect date type from the csv file while reading
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-25517
>                 URL: https://issues.apache.org/jira/browse/SPARK-25517
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0, 2.3.1
>         Environment: Spark 2.3.0
>            Reporter: Manoranjan Kumar
>            Priority: Major
>              Labels: easyfix
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> spark.read.format("csv").option("inferSchema", true).option("dateFormat", "MM/dd/yyyy") fails to detect or infer the date type while reading the csv file having date column in the specified format(MM/dd/yyyy)
> For example:-
> An employee csv file (employee.csv) has following two sample dummy records (with header):
> emp_id,emp_name,joining_date,emp_age, emp_in_time,emp_salary
> 100,Bradd Pitt,{color:#f6c342}09/25/2018{color},26,{color:#f691b2}09/25/2018 10:12:36{color},10000.00
> 101,Angel Joli,{color:#f6c342}08/20/2018{color},28,{color:#f691b2}08/20/2018 11:32:58{color},12000.00
> when I read the above csv file as dataframe like below: 
> val empDF = spark.read.format("csv").option("inferSchema", true).option("dateFormat","MM/dd/yyyy").option("timestampFormat","MM/dd/yyyy HH:mm:ss").load(employee.csv)
> empDF.printSchema()
> results/output:
> root
>  |-- emp_id: integer (nullable = true)
>  |-- emp_name: string (nullable = true)
>  |-- {color:#d04437}joining_date: string{color} (nullable = true)
>  |-- emp_age: integer (nullable = true)
>  |-- {color:#d04437}emp_in_time: timestamp{color} (nullable = true)
>  |-- emp_salary: double (nullable = true)
> Please notice above (marked in {color:#d04437}red{color} color) the data type automatically inferred by spark for joining_date and emp_in_time, for joining_date, it fails to detect as date type and the type remains as {color:#d04437}string{color} as it is, whereas it detects well for emp_in_time as {color:#d04437}timestamp{color}
> This was the issue that I struggled with for a complete day, and when I dived deep into the spark source code, i found the implementation for date type is missing whereas the implementation for timestamp is present in all its glory.
> I am new to this place (exactly first timer), please get back in case of further information or live example with running code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org