You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Pablo Langa Blanco (Jira)" <ji...@apache.org> on 2020/06/20 21:28:00 UTC

[jira] [Commented] (SPARK-32025) CSV schema inference with boolean & integer

    [ https://issues.apache.org/jira/browse/SPARK-32025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141213#comment-17141213 ] 

Pablo Langa Blanco commented on SPARK-32025:
--------------------------------------------

I'm looking for the problem, as a workaround you can define the schema to avoid the bug on infer schema automatically

> CSV schema inference with boolean & integer 
> --------------------------------------------
>
>                 Key: SPARK-32025
>                 URL: https://issues.apache.org/jira/browse/SPARK-32025
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.6
>            Reporter: Brian Wallace
>            Priority: Major
>
> I have a dataset consisting of two small files in CSV format. 
> {code:bash}
> $ cat /example/f0.csv
> col1
> 8589934592
> $ cat /example/f1.csv
> col1
> 43200000
> true
> {code}
>  
> When I try and load this in (py)spark and infer schema, my expectation is that the column is inferred to be a string. However, it is inferred as a boolean:
> {code:python}
> spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True, multiLine=True).show()
> +----+
> |col1|
> +----+
> |null|
> |true|
> |null|
> +----+
> {code}
> Note that this seems to work correctly if multiLine is set to False (although we need to set it to True as this column may indeed span multiple lines in general).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org