You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/03/24 07:46:41 UTC
[jira] [Resolved] (SPARK-20012) spark.read.csv schemas effectively
ignore headers
[ https://issues.apache.org/jira/browse/SPARK-20012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-20012.
-------------------------------
Resolution: Not A Problem
> spark.read.csv schemas effectively ignore headers
> -------------------------------------------------
>
> Key: SPARK-20012
> URL: https://issues.apache.org/jira/browse/SPARK-20012
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 2.1.0
> Environment: pyspark
> Reporter: david cottrell
> Priority: Minor
>
> New to Spark, so please direct me elsewhere if there is another place for this kind of discussion.
> To my understanding, schema are ordered *named* structures however it seems the names are not being used when reading files with headers.
> I had a quick look at the DataFrameReader code and it seems like it might not be too hard to
> a) let the schema set the "global" order of the columns
> b) per file, map the columns *by name* to the schema ordering and apply the types on load.
> A simple way of saying this is that the schema is an ordered dictionary and the files with headers only define dictionaries.
> A typical example showing what I think are the implications of this problem:
> {code}
> In [248]: a = spark.read.csv('./data/test.csv.gz', header=True, inferSchema=True).toPandas()
> In [249]: b = spark.read.csv('./data/0.csv.gz', header=True, inferSchema=True).toPandas()
> In [250]: d = pd.concat([a, b])
> In [251]: df = spark.read.csv('./data/{test,0}.csv.gz', header=True, inferSchema=True).toPandas()
> In [252]: df[['b', 'c', 'd', 'e']] = df[['b', 'c', 'd', 'e']].astype(float)
> In [253]: a
> Out[253]:
> a b e d c
> 0 test -0.874197 0.168660 -0.948726 0.479723
> 1 test 1.124383 0.620870 0.159186 0.993676
> 2 test -1.429108 -0.048814 -0.057273 -1.331702
> In [254]: b
> Out[254]:
> a b c d e
> 0 0 -1.671828 -1.259530 0.905029 0.487244
> 1 0 -0.024553 -1.750904 0.004466 1.978049
> 2 0 1.686806 0.175431 0.677609 -0.851670
> In [255]: d
> Out[255]:
> a b c d e
> 0 test -0.874197 0.479723 -0.948726 0.168660
> 1 test 1.124383 0.993676 0.159186 0.620870
> 2 test -1.429108 -1.331702 -0.057273 -0.048814
> 0 0 -1.671828 -1.259530 0.905029 0.487244
> 1 0 -0.024553 -1.750904 0.004466 1.978049
> 2 0 1.686806 0.175431 0.677609 -0.851670
> In [256]: df
> Out[256]:
> a b c d e
> 0 test -0.874197 0.168660 -0.948726 0.479723
> 1 test 1.124383 0.620870 0.159186 0.993676
> 2 test -1.429108 -0.048814 -0.057273 -1.331702
> 3 0 -1.671828 -1.259530 0.905029 0.487244
> 4 0 -0.024553 -1.750904 0.004466 1.978049
> 5 0 1.686806 0.175431 0.677609 -0.851670
> {code}
> Example also posted here: http://stackoverflow.com/questions/42637497/pyspark-2-1-0-spark-read-csv-scrambles-columns
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org