You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "ohad (Jira)" <ji...@apache.org> on 2022/10/16 08:25:00 UTC

[jira] [Created] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

ohad created SPARK-40808:
----------------------------

             Summary: Infer schema for CSV files - wrong behavior using header + merge schema
                 Key: SPARK-40808
                 URL: https://issues.apache.org/jira/browse/SPARK-40808
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.2.2
            Reporter: ohad


Hello. 
I am writing some unit-tests to some functionality in my application that reading data from CSV files using Spark.

I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```

When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```

I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```

When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```

For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org