You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Jeff Zhang <zj...@gmail.com> on 2015/10/15 04:26:30 UTC

Should enforce the uniqueness of field name in DataFrame ?

Currently seems DataFrame doesn't enforce the uniqueness of field name. So
it is possible to have same fields in DataFrame. It usually happens after
join especially self-join. Although user can rename the column names before
join, or rename the column names after join (DataFrame#withColunmRenamed is
not sufficient for now).  In hive, the ambiguous name can be resolved by
using the table name as prefix, but seems DataFrame don't support it ( I
mean DataFrame API rather than SparkSQL). I think we have 2 options here
1. Enforce the uniqueness of field name in DataFrame, so that the following
operations would not cause ambiguous column reference
2. Provide DataFrame#withColunmsRenamed(oldColumns:Seq[String],
newColumns:Seq[String]) to allow change schema names

For now, I would prefer option 2 which is more easier to implement and keep
compatibility.


val df = ...        // schema (name, age)
val df2 = df.join(df, "name")   // schema (name, age, age)
df2.select("age")   // ambiguous column reference.

-- 
Best Regards

Jeff Zhang

Re: Should enforce the uniqueness of field name in DataFrame ?

Posted by Michael Armbrust <mi...@databricks.com>.
>
>  In hive, the ambiguous name can be resolved by using the table name as
> prefix, but seems DataFrame don't support it ( I mean DataFrame API rather
> than SparkSQL)


You can do the same using pure DataFrames.

Seq((1,2)).toDF("a", "b").registerTempTable("y")
Seq((1,4)).toDF("a", "b").registerTempTable("x")
​
table("x").join(table("y"), $"x.a" === $"y.a").select("y.b", "x.b").show()
+-+-+
|b|b|
+-+-+
|2|4|
+-+-+

DataFrame did check for duplicate column names until Sep 2014, but then the
> check got pushed into the SQL planner making DataFrame standalone (so
> without SQL) less useful as an API.


The check in question was removed because it made it impossible to even
reason about a schema that had duplicate column names.  In general, it
seems restrictive to throw an error if duplicate column names exist in an
intermediate schema even when they aren't referenced ambiguously.  We could
consider adding an option to throw an error during analysis for this case,
but it certainly shouldn't be in the constructor of StructType.  My guess
is an option to rename as Reynold suggests would be more popular (though
this could probably not be the default without breaking things).

Anther option that seems nice to me is to always add default qualifiers of
left/right when doing a join.  So you could always do:

df.join(df).where("left.a = right.a")

Even when you didn't manually specify left/right.  This could be done only
when there is not a qualifier already called left or right.

Re: Should enforce the uniqueness of field name in DataFrame ?

Posted by Koert Kuipers <ko...@tresata.com>.
if DataFrame aspires to be more than a vehicle for SQL then i think it
would be mistake to allow multiple column names. it is very confusing.
pandas indeed allows this and it has led to many bugs. R does not allow it
for data.frame (it renames the name dupes).

i would consider a csv with duplicate column names invalid and it should
not be loaded, or if it is loaded dupes should be renamed (e.g. append a
"1" to the name).

DataFrame did check for duplicate column names until Sep 2014, but then the
check got pushed into the SQL planner making DataFrame standalone (so
without SQL) less useful as an API.

i filed a jira about this a while ago here:
https://issues.apache.org/jira/browse/SPARK-8817



On Thu, Oct 15, 2015 at 3:05 AM, Xiao Li <ga...@gmail.com> wrote:

> True. As long as we can ensure the correct message are printed out, users
> can correct their app easily. For example, Reference 'name' is ambiguous,
> could be: name#1, name#5.;
>
> Thanks,
>
> Xiao Li
>
> 2015-10-14 23:58 GMT-07:00 Reynold Xin <rx...@databricks.com>:
>
>> That could break a lot of applications. In particular, a lot of input
>> data sources (csv, json) don't have clean schema, and can have duplicate
>> column names.
>>
>> For the case of join, maybe a better solution is to ask the left/right
>> prefix/suffix in the user code, similar to what Pandas does.
>>
>> On Wed, Oct 14, 2015 at 7:26 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>
>>>
>>> Currently seems DataFrame doesn't enforce the uniqueness of field name.
>>> So it is possible to have same fields in DataFrame. It usually happens
>>> after join especially self-join. Although user can rename the column names
>>> before join, or rename the column names after join
>>> (DataFrame#withColunmRenamed is not sufficient for now).  In hive, the
>>> ambiguous name can be resolved by using the table name as prefix, but seems
>>> DataFrame don't support it ( I mean DataFrame API rather than SparkSQL). I
>>> think we have 2 options here
>>> 1. Enforce the uniqueness of field name in DataFrame, so that the
>>> following operations would not cause ambiguous column reference
>>> 2. Provide DataFrame#withColunmsRenamed(oldColumns:Seq[String],
>>> newColumns:Seq[String]) to allow change schema names
>>>
>>> For now, I would prefer option 2 which is more easier to implement and
>>> keep compatibility.
>>>
>>>
>>> val df = ...        // schema (name, age)
>>> val df2 = df.join(df, "name")   // schema (name, age, age)
>>> df2.select("age")   // ambiguous column reference.
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>

Re: Should enforce the uniqueness of field name in DataFrame ?

Posted by Xiao Li <ga...@gmail.com>.
True. As long as we can ensure the correct message are printed out, users
can correct their app easily. For example, Reference 'name' is ambiguous,
could be: name#1, name#5.;

Thanks,

Xiao Li

2015-10-14 23:58 GMT-07:00 Reynold Xin <rx...@databricks.com>:

> That could break a lot of applications. In particular, a lot of input data
> sources (csv, json) don't have clean schema, and can have duplicate column
> names.
>
> For the case of join, maybe a better solution is to ask the left/right
> prefix/suffix in the user code, similar to what Pandas does.
>
> On Wed, Oct 14, 2015 at 7:26 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
>>
>> Currently seems DataFrame doesn't enforce the uniqueness of field name.
>> So it is possible to have same fields in DataFrame. It usually happens
>> after join especially self-join. Although user can rename the column names
>> before join, or rename the column names after join
>> (DataFrame#withColunmRenamed is not sufficient for now).  In hive, the
>> ambiguous name can be resolved by using the table name as prefix, but seems
>> DataFrame don't support it ( I mean DataFrame API rather than SparkSQL). I
>> think we have 2 options here
>> 1. Enforce the uniqueness of field name in DataFrame, so that the
>> following operations would not cause ambiguous column reference
>> 2. Provide DataFrame#withColunmsRenamed(oldColumns:Seq[String],
>> newColumns:Seq[String]) to allow change schema names
>>
>> For now, I would prefer option 2 which is more easier to implement and
>> keep compatibility.
>>
>>
>> val df = ...        // schema (name, age)
>> val df2 = df.join(df, "name")   // schema (name, age, age)
>> df2.select("age")   // ambiguous column reference.
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>

Re: Should enforce the uniqueness of field name in DataFrame ?

Posted by Reynold Xin <rx...@databricks.com>.
That could break a lot of applications. In particular, a lot of input data
sources (csv, json) don't have clean schema, and can have duplicate column
names.

For the case of join, maybe a better solution is to ask the left/right
prefix/suffix in the user code, similar to what Pandas does.

On Wed, Oct 14, 2015 at 7:26 PM, Jeff Zhang <zj...@gmail.com> wrote:

>
> Currently seems DataFrame doesn't enforce the uniqueness of field name. So
> it is possible to have same fields in DataFrame. It usually happens after
> join especially self-join. Although user can rename the column names before
> join, or rename the column names after join (DataFrame#withColunmRenamed is
> not sufficient for now).  In hive, the ambiguous name can be resolved by
> using the table name as prefix, but seems DataFrame don't support it ( I
> mean DataFrame API rather than SparkSQL). I think we have 2 options here
> 1. Enforce the uniqueness of field name in DataFrame, so that the
> following operations would not cause ambiguous column reference
> 2. Provide DataFrame#withColunmsRenamed(oldColumns:Seq[String],
> newColumns:Seq[String]) to allow change schema names
>
> For now, I would prefer option 2 which is more easier to implement and
> keep compatibility.
>
>
> val df = ...        // schema (name, age)
> val df2 = df.join(df, "name")   // schema (name, age, age)
> df2.select("age")   // ambiguous column reference.
>
> --
> Best Regards
>
> Jeff Zhang
>