You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zzzzming95 (Jira)" <ji...@apache.org> on 2022/10/22 09:40:00 UTC
[jira] [Comment Edited] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

    [ https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622594#comment-17622594 ] 

zzzzming95 edited comment on SPARK-40808 at 10/22/22 9:39 AM:
--------------------------------------------------------------

[~ohadm] 

In spark , infer csv schema will skip first line when set header option is true. But only the header of one file will be regarded as the real first line, which means that if there are two files with different headers, the header of one file will be used as data to infer schema.

 

In this case , You can keep all files with the same header to pass unit test4. 
{code:java}
//file2.csv
"int_col","string_col","double_col","int2_col"
12,"hello2",1.432
22,"world2",5.5342
32,"my name2",86.4552
42,"is ohad2",6.2342 {code}
 

Read the csv directory, it is reasonable to assume that all files in the directory have the same schema by default. If there are no other doubts, i will mark this issue as resolved.


was (Author: zing):
[~ohadm] 

In spark , infer csv schema will skip first line when set header option is true. But only the header of one file will be regarded as the real first line, which means that if there are two files with different headers, the header of one file will be used as data to infer schema.

 

In this case , You can keep all files with the same header to pass unit test4. 

 
{code:java}
//file2.csv
"int_col","string_col","double_col","int2_col"
12,"hello2",1.432
22,"world2",5.5342
32,"my name2",86.4552
42,"is ohad2",6.2342 {code}

> Infer schema for CSV files - wrong behavior using header + merge schema
> -----------------------------------------------------------------------
>
>                 Key: SPARK-40808
>                 URL: https://issues.apache.org/jira/browse/SPARK-40808
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.2
>            Reporter: ohad
>            Priority: Major
>              Labels: CSVReader, csv, csvparser
>         Attachments: test_csv.py
>
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading data from CSV files using Spark.
> I am reading the data using:
> {code:java}
> header=True
> mergeSchema=True
> inferSchema=True{code}
> When I am reading this single file:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22{code}
> I am getting this schema:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string{code}
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2
> {code}
> result:
> {code:java}
> int_col=string
> string_col=string
> decimal_col=string
> date_col=string
> int2_col=int{code}
> When I am reading only the second file, it looks fine:
> {code:java}
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2{code}
> result:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string
> int2_col=int{code}
> For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org