You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "koert kuipers (JIRA)" <ji...@apache.org> on 2015/07/03 20:32:04 UTC

[jira] [Created] (SPARK-8817) DataFrame should not allow duplicate colum names

koert kuipers created SPARK-8817:
------------------------------------

             Summary: DataFrame should not allow duplicate colum names
                 Key: SPARK-8817
                 URL: https://issues.apache.org/jira/browse/SPARK-8817
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.4.0
            Reporter: koert kuipers
            Priority: Minor


pull 2209 (https://github.com/apache/spark/pull/2209) for SPARK-2890 disabled field name validation (which checks for duplicate column names) in StructType, in favor of throwing throwing an error in SQL query analysis.

the problem with this is that it is not intuitive for a DataFrame to have duplicate column names, and not all usage of DataFrame involves SQL queries.

by removing the check from StructType and hence from DataFrame it becomes the responsibility of the DSLs that are build on top of DataFrame to do these checks, which is more burdensome and can lead to subtle errors. i ran into this while writing an alternative DSL for DataFrame.

In R duplicate columns get automatically renamed:
> data.frame(x = c(1,2), x = c(3,4))
  x x.1
1 1   3
2 2   4

i believe pandas does allow duplicate names, but i am not sure (never used it).

maybe StructType.validateFields can do something similar to what R does and simply renames the dupes?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org