You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "ohad (Jira)" <ji...@apache.org> on 2022/10/16 08:25:00 UTC
[jira] [Created] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema
ohad created SPARK-40808:
----------------------------
Summary: Infer schema for CSV files - wrong behavior using header + merge schema
Key: SPARK-40808
URL: https://issues.apache.org/jira/browse/SPARK-40808
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.2.2
Reporter: ohad
Hello.
I am writing some unit-tests to some functionality in my application that reading data from CSV files using Spark.
I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```
When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```
I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```
When I am duplicating this file, I am getting the same schema.
The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```
result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```
When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```
result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```
For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org