You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/06/26 01:43:00 UTC

[jira] [Resolved] (SPARK-32025) CSV schema inference with boolean & integer

     [ https://issues.apache.org/jira/browse/SPARK-32025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-32025.
----------------------------------
    Fix Version/s: 3.1.0
       Resolution: Fixed

Issue resolved by pull request 28896
[https://github.com/apache/spark/pull/28896]

> CSV schema inference with boolean & integer 
> --------------------------------------------
>
>                 Key: SPARK-32025
>                 URL: https://issues.apache.org/jira/browse/SPARK-32025
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.6
>            Reporter: Brian Wallace
>            Assignee: Pablo Langa Blanco
>            Priority: Major
>             Fix For: 3.1.0
>
>
> I have a dataset consisting of two small files in CSV format. 
> {code:bash}
> $ cat /example/f0.csv
> col1
> 8589934592
> $ cat /example/f1.csv
> col1
> 43200000
> true
> {code}
>  
> When I try and load this in (py)spark and infer schema, my expectation is that the column is inferred to be a string. However, it is inferred as a boolean:
> {code:python}
> spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True, multiLine=True).show()
> +----+
> |col1|
> +----+
> |null|
> |true|
> |null|
> +----+
> {code}
> Note that this seems to work correctly if multiLine is set to False (although we need to set it to True as this column may indeed span multiple lines in general).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org