You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sun Rui (JIRA)" <ji...@apache.org> on 2015/10/08 10:48:26 UTC

[jira] [Commented] (SPARK-10981) R semijoin leads to Java errors, R leftsemi leads to Spark errors

    [ https://issues.apache.org/jira/browse/SPARK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948299#comment-14948299 ] 

Sun Rui commented on SPARK-10981:
---------------------------------

yes, this is a bug in SparkR. your fix looks good. Could you submit a PR for this?

In the PR, please:
1. Support all join types defined in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala (You can move the "_" char from the currently supported join types in SparkR)

2. Add test cases for missing join types including "leftsemi"

> R semijoin leads to Java errors, R leftsemi leads to Spark errors
> -----------------------------------------------------------------
>
>                 Key: SPARK-10981
>                 URL: https://issues.apache.org/jira/browse/SPARK-10981
>             Project: Spark
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 1.5.0
>         Environment: SparkR from RStudio on Macbook
>            Reporter: Monica Liu
>            Priority: Minor
>              Labels: easyfix, newbie
>
> I am using SparkR from RStudio, and I ran into an error with the join function that I recreated with a smaller example:
> {code:title=joinTest.R|borderStyle=solid}
> Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> sc <- sparkR.init("local[4]")
> sqlContext <- sparkRSQL.init(sc) 
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b)
> df1= createDataFrame(sqlContext, df)
> showDF(df1)
> x = c(2, 3, 10)
> t = c("dd", "ee", "ff")
> c = c(FALSE, FALSE, TRUE)
> dff = data.frame(x, t, c)
> df2 = createDataFrame(sqlContext, dff)
> showDF(df2)
> res = join(df1, df2, df1$n == df2$x, "semijoin")
> showDF(res)
> {code}
> Running this code, I encountered the error:
> {panel}
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. Supported join types include: 'inner', 'outer', 'full', 'fullouter', 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'.
> {panel}
> However, if I changed the joinType to "leftsemi", 
> {code}
> res = join(df1, df2, df1$n == df2$x, "leftsemi")
> {code}
> I would get the error:
> {panel}
> Error in .local(x, y, ...) : 
>   joinType must be one of the following types: 'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'
> {panel}
> Since the join function in R appears to invoke a Java method, I went into DataFrame.R and changed the code on line 1374 and line 1378 to change the "semijoin" to "leftsemi" to match the Java function's parameters. These also make the R joinType accepted values match those of Scala's. 
> semijoin:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "semijoin")) {
>     sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>      stop("joinType must be one of the following types: ",
>              "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'")
> }
> {code}
> leftsemi:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "leftsemi")) {
>     sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>      stop("joinType must be one of the following types: ",
>              "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'")
> }
> {code}
> This fixed the issue, but I'm not sure if this solution breaks hive compatibility or causes other issues, but I can submit a pull request to change this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org