You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Eric Friedman <er...@gmail.com> on 2015/03/21 23:02:32 UTC

join two DataFrames, same column name

I have a couple of data frames that I pulled from SparkSQL and the primary
key of one is a foreign key of the same name in the other.  I'd rather not
have to specify each column in the SELECT statement just so that I can
rename this single column.

When I try to join the data frames, I get an exception because it finds the
two columns of the same name to be ambiguous.  Is there a way to specify
which side of the join comes from data frame A and which comes from B?

var df1 = sqlContext.sql("select * from table1")
var df2 = sqlContext.sql("select * from table2)

df1.join(df2, df1("column_id") === df2("column_id"))

Re: join two DataFrames, same column name

Posted by Eric Friedman <er...@gmail.com>.
Michael, thank you for the workaround and for letting me know of the
upcoming enhancements, both of which sound appealing.

On Sun, Mar 22, 2015 at 1:25 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> You can include * and a column alias in the same select clause
> var df1 = sqlContext.sql("select *, column_id AS table1_id from table1")
>
>
> I'm also hoping to resolve SPARK-6376
> <https://issues.apache.org/jira/browse/SPARK-6376> before Spark 1.3.1
> which will let you do something like:
> var df1 = sqlContext.sql("select * from table1").as("t1")
> var df2 = sqlContext.sql("select * from table2).as("t2")
> df1.join(df2, df1("column_id") === df2("column_id")).select("t1.column_id")
>
> Finally, there is SPARK-6380
> <https://issues.apache.org/jira/browse/SPARK-6380> that hopes to simplify
> this particular case.
>
> Michael
>
> On Sat, Mar 21, 2015 at 3:02 PM, Eric Friedman <er...@gmail.com>
> wrote:
>
>> I have a couple of data frames that I pulled from SparkSQL and the
>> primary key of one is a foreign key of the same name in the other.  I'd
>> rather not have to specify each column in the SELECT statement just so that
>> I can rename this single column.
>>
>> When I try to join the data frames, I get an exception because it finds
>> the two columns of the same name to be ambiguous.  Is there a way to
>> specify which side of the join comes from data frame A and which comes from
>> B?
>>
>> var df1 = sqlContext.sql("select * from table1")
>> var df2 = sqlContext.sql("select * from table2)
>>
>> df1.join(df2, df1("column_id") === df2("column_id"))
>>
>
>

Re: join two DataFrames, same column name

Posted by Eric Friedman <er...@gmail.com>.
>
> You can include * and a column alias in the same select clause
> var df1 = sqlContext.sql("select *, column_id AS table1_id from table1")


FYI, this does not ultimately work as the * still includes column_id and
you cannot have two columns of that name in the joined DataFrame.  So I
ended up aliasing both sides of the join.

On Sun, Mar 22, 2015 at 1:25 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> You can include * and a column alias in the same select clause
> var df1 = sqlContext.sql("select *, column_id AS table1_id from table1")
>
>
> I'm also hoping to resolve SPARK-6376
> <https://issues.apache.org/jira/browse/SPARK-6376> before Spark 1.3.1
> which will let you do something like:
> var df1 = sqlContext.sql("select * from table1").as("t1")
> var df2 = sqlContext.sql("select * from table2).as("t2")
> df1.join(df2, df1("column_id") === df2("column_id")).select("t1.column_id")
>
> Finally, there is SPARK-6380
> <https://issues.apache.org/jira/browse/SPARK-6380> that hopes to simplify
> this particular case.
>
> Michael
>
> On Sat, Mar 21, 2015 at 3:02 PM, Eric Friedman <er...@gmail.com>
> wrote:
>
>> I have a couple of data frames that I pulled from SparkSQL and the
>> primary key of one is a foreign key of the same name in the other.  I'd
>> rather not have to specify each column in the SELECT statement just so that
>> I can rename this single column.
>>
>> When I try to join the data frames, I get an exception because it finds
>> the two columns of the same name to be ambiguous.  Is there a way to
>> specify which side of the join comes from data frame A and which comes from
>> B?
>>
>> var df1 = sqlContext.sql("select * from table1")
>> var df2 = sqlContext.sql("select * from table2)
>>
>> df1.join(df2, df1("column_id") === df2("column_id"))
>>
>
>

Re: join two DataFrames, same column name

Posted by Michael Armbrust <mi...@databricks.com>.
You can include * and a column alias in the same select clause
var df1 = sqlContext.sql("select *, column_id AS table1_id from table1")


I'm also hoping to resolve SPARK-6376
<https://issues.apache.org/jira/browse/SPARK-6376> before Spark 1.3.1 which
will let you do something like:
var df1 = sqlContext.sql("select * from table1").as("t1")
var df2 = sqlContext.sql("select * from table2).as("t2")
df1.join(df2, df1("column_id") === df2("column_id")).select("t1.column_id")

Finally, there is SPARK-6380
<https://issues.apache.org/jira/browse/SPARK-6380> that hopes to simplify
this particular case.

Michael

On Sat, Mar 21, 2015 at 3:02 PM, Eric Friedman <er...@gmail.com>
wrote:

> I have a couple of data frames that I pulled from SparkSQL and the primary
> key of one is a foreign key of the same name in the other.  I'd rather not
> have to specify each column in the SELECT statement just so that I can
> rename this single column.
>
> When I try to join the data frames, I get an exception because it finds
> the two columns of the same name to be ambiguous.  Is there a way to
> specify which side of the join comes from data frame A and which comes from
> B?
>
> var df1 = sqlContext.sql("select * from table1")
> var df2 = sqlContext.sql("select * from table2)
>
> df1.join(df2, df1("column_id") === df2("column_id"))
>