You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/03/24 07:46:41 UTC
[jira] [Resolved] (SPARK-20012) spark.read.csv schemas effectively ignore headers

     [ https://issues.apache.org/jira/browse/SPARK-20012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-20012.
-------------------------------
    Resolution: Not A Problem

> spark.read.csv schemas effectively ignore headers
> -------------------------------------------------
>
>                 Key: SPARK-20012
>                 URL: https://issues.apache.org/jira/browse/SPARK-20012
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.1.0
>         Environment: pyspark
>            Reporter: david cottrell
>            Priority: Minor
>
> New to Spark, so please direct me elsewhere if there is another place for this kind of discussion.
> To my understanding, schema are ordered *named* structures however it seems the names are not being used when reading files with headers.
> I had a quick look at the DataFrameReader code and it seems like it might not be too hard to
> a) let the schema set the "global" order of the columns
> b) per file, map the columns *by name* to the schema ordering and apply the types on load.
> A simple way of saying this is that the schema is an ordered dictionary and the files with headers only define dictionaries.
> A typical example showing what I think are the implications of this problem: 
> {code}
> In [248]: a = spark.read.csv('./data/test.csv.gz', header=True, inferSchema=True).toPandas()
> In [249]: b = spark.read.csv('./data/0.csv.gz', header=True, inferSchema=True).toPandas()
> In [250]: d = pd.concat([a, b])
> In [251]: df = spark.read.csv('./data/{test,0}.csv.gz', header=True, inferSchema=True).toPandas()
> In [252]: df[['b', 'c', 'd', 'e']] = df[['b', 'c', 'd', 'e']].astype(float)
> In [253]: a
> Out[253]:
>       a         b         e         d         c
> 0  test -0.874197  0.168660 -0.948726  0.479723
> 1  test  1.124383  0.620870  0.159186  0.993676
> 2  test -1.429108 -0.048814 -0.057273 -1.331702
> In [254]: b
> Out[254]:
>    a         b         c         d         e
> 0  0 -1.671828 -1.259530  0.905029  0.487244
> 1  0 -0.024553 -1.750904  0.004466  1.978049
> 2  0  1.686806  0.175431  0.677609 -0.851670
> In [255]: d
> Out[255]:
>       a         b         c         d         e
> 0  test -0.874197  0.479723 -0.948726  0.168660
> 1  test  1.124383  0.993676  0.159186  0.620870
> 2  test -1.429108 -1.331702 -0.057273 -0.048814
> 0     0 -1.671828 -1.259530  0.905029  0.487244
> 1     0 -0.024553 -1.750904  0.004466  1.978049
> 2     0  1.686806  0.175431  0.677609 -0.851670
> In [256]: df
> Out[256]:
>       a         b         c         d         e
> 0  test -0.874197  0.168660 -0.948726  0.479723
> 1  test  1.124383  0.620870  0.159186  0.993676
> 2  test -1.429108 -0.048814 -0.057273 -1.331702
> 3     0 -1.671828 -1.259530  0.905029  0.487244
> 4     0 -0.024553 -1.750904  0.004466  1.978049
> 5     0  1.686806  0.175431  0.677609 -0.851670
> {code}
> Example also posted here: http://stackoverflow.com/questions/42637497/pyspark-2-1-0-spark-read-csv-scrambles-columns



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org