You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/08/22 04:50:22 UTC

[jira] [Assigned] (SPARK-16896) Loading csv with duplicate column names

     [ https://issues.apache.org/jira/browse/SPARK-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-16896:
------------------------------------

    Assignee:     (was: Apache Spark)

> Loading csv with duplicate column names
> ---------------------------------------
>
>                 Key: SPARK-16896
>                 URL: https://issues.apache.org/jira/browse/SPARK-16896
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Aseem Bansal
>
> It would be great if the library allows us to load csv with duplicate column names. I understand that having duplicate columns in the data is odd but sometimes we get data that has duplicate columns. Getting upstream data like that can happen. We may choose to ignore them but currently there is no way to drop those as we are not able to load them at all. Currently as a pre-processing I loaded the data into R, changed the column names and then make a fixed version with which Spark Java API can work.
> But if talk about other options, e.g. R has read.csv which automatically takes care of such situation by appending a number to the column name.
> Also case sensitivity in column names can also cause problems. I mean if we have columns like
> ColumnName, columnName
> I may want to have them as separate. But the option to do this is not documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org