You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sun Rui (JIRA)" <ji...@apache.org> on 2015/10/22 08:04:27 UTC
[jira] [Comment Edited] (SPARK-11231) join returns schema with duplicated and ambiguous join columns

    [ https://issues.apache.org/jira/browse/SPARK-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968608#comment-14968608 ] 

Sun Rui edited comment on SPARK-11231 at 10/22/15 6:03 AM:
-----------------------------------------------------------

[~shivaram] This feature can be done on R side even though it is not supported on Scala side. But if Scala side can support this, R implementation can be simplified.


was (Author: sunrui):
[~shivaram] This feature can be done on R side even though it is supported on Scala side. But if Scala side can support this, R implementation can be simplified.

> join returns schema with duplicated and ambiguous join columns
> --------------------------------------------------------------
>
>                 Key: SPARK-11231
>                 URL: https://issues.apache.org/jira/browse/SPARK-11231
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 1.5.1
>         Environment: R
>            Reporter: Matt Pollock
>
> In the case where the key column of two data frames are named the same thing, join returns a data frame where that column is duplicated. Since the content of the columns is guaranteed to be the same by row consolidating the identical columns into a single column would replicate standard R behavior[1] and help prevent ambiguous names.
> Example:
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > sdf1 <- createDataFrame(sqlContext, df1)
> > sdf2 <- createDataFrame(sqlContext, df2)
> > sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
> > schema(sjdf)
> StructType
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value1", type = "DoubleType", nullable = TRUE
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value2", type = "DoubleType", nullable = TRUE
> {code}
> The duplicated key columns cause things like:
> {code}
> > library(magrittr)
> > sjdf %>% select("key")
> 15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could be: key#125, key#127.;
> 	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
> 	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
> 	at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
> 	at org.apache.spark.sql.catalyst.tree
> {code}
> [1] In base R there is no"join", but a similar function "merge" is provided in which a "by" argument identifies the shared key column in the two data frames. In the case where the key column names differ "by.x" and "by.y" arguments can be used. In the case of same-named key columns the consolidation behavior requested above is observed. In the case of differing names they "by.x" name is retained and consolidated with the "by.y" column which is dropped.
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > merge(df1, df2, by="key")
>   key value1 value2
> 1   A      1      4
> 2   B      2      5
> 3   C      3      6
> df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3))
> > merge(df2, df3, by.x="key", by.y="akey")
>   key value2 value1
> 1   A      4      1
> 2   B      5      2
> 3   C      6      3
> > merge(df3, df2, by.x="akey", by.y="key")
>   akey value1 value2
> 1    A      1      4
> 2    B      2      5
> 3    C      3      6
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org