You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2019/03/16 04:05:17 UTC
[spark] branch master updated: [SPARK-27096][SQL][FOLLOWUP] Do the
correct validation of join types in R side and fix join docs for scala,
python and r
This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 7a136f8 [SPARK-27096][SQL][FOLLOWUP] Do the correct validation of join types in R side and fix join docs for scala, python and r
7a136f8 is described below
commit 7a136f867049adbdb0c1c31de91ea8488a0c3a77
Author: Dilip Biswal <db...@us.ibm.com>
AuthorDate: Sat Mar 16 13:04:54 2019 +0900
[SPARK-27096][SQL][FOLLOWUP] Do the correct validation of join types in R side and fix join docs for scala, python and r
## What changes were proposed in this pull request?
This is a minor follow-up PR for SPARK-27096. The original PR reconciled the join types supported between dataset and sql interface. In case of R, we do the join type validation in the R side. In this PR we do the correct validation and adds tests in R to test all the join types along with the error condition. Along with this, i made the necessary doc correction.
## How was this patch tested?
Add R tests.
Closes #24087 from dilipbiswal/joinfix_followup.
Authored-by: Dilip Biswal <db...@us.ibm.com>
Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
R/pkg/R/DataFrame.R | 15 +--
R/pkg/tests/fulltests/test_sparkSQL.R | 105 ++++++++++++++++-----
python/pyspark/sql/dataframe.py | 5 +-
.../main/scala/org/apache/spark/sql/Dataset.scala | 14 +--
4 files changed, 99 insertions(+), 40 deletions(-)
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 9ad64a7..014ba28 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -2520,8 +2520,9 @@ setMethod("dropDuplicates",
#' Column expression. If joinExpr is omitted, the default, inner join is attempted and an error is
#' thrown if it would be a Cartesian Product. For Cartesian join, use crossJoin instead.
#' @param joinType The type of join to perform, default 'inner'.
-#' Must be one of: 'inner', 'cross', 'outer', 'full', 'full_outer',
-#' 'left', 'left_outer', 'right', 'right_outer', 'left_semi', or 'left_anti'.
+#' Must be one of: 'inner', 'cross', 'outer', 'full', 'fullouter', 'full_outer',
+#' 'left', 'leftouter', 'left_outer', 'right', 'rightouter', 'right_outer', 'semi',
+#' 'leftsemi', 'left_semi', 'anti', 'leftanti', 'left_anti'.
#' @return A SparkDataFrame containing the result of the join operation.
#' @family SparkDataFrame functions
#' @aliases join,SparkDataFrame,SparkDataFrame-method
@@ -2553,14 +2554,14 @@ setMethod("join",
"outer", "full", "fullouter", "full_outer",
"left", "leftouter", "left_outer",
"right", "rightouter", "right_outer",
- "left_semi", "leftsemi", "left_anti", "leftanti")) {
+ "semi", "left_semi", "leftsemi", "anti", "left_anti", "leftanti")) {
joinType <- gsub("_", "", joinType)
sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
} else {
- stop("joinType must be one of the following types: ",
- "'inner', 'cross', 'outer', 'full', 'full_outer',",
- "'left', 'left_outer', 'right', 'right_outer',",
- "'left_semi', or 'left_anti'.")
+ stop(paste("joinType must be one of the following types:",
+ "'inner', 'cross', 'outer', 'full', 'fullouter', 'full_outer',",
+ "'left', 'leftouter', 'left_outer', 'right', 'rightouter', 'right_outer',",
+ "'semi', 'leftsemi', 'left_semi', 'anti', 'leftanti' or 'left_anti'."))
}
}
}
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R b/R/pkg/tests/fulltests/test_sparkSQL.R
index c60c951..c9d6134 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -2356,40 +2356,95 @@ test_that("join(), crossJoin() and merge() on a DataFrame", {
expect_equal(names(joined2), c("age", "name", "name", "test"))
expect_equal(count(joined2), 3)
- joined3 <- join(df, df2, df$name == df2$name, "rightouter")
+ joined3 <- join(df, df2, df$name == df2$name, "right")
expect_equal(names(joined3), c("age", "name", "name", "test"))
expect_equal(count(joined3), 4)
expect_true(is.na(collect(orderBy(joined3, joined3$age))$age[2]))
-
- joined4 <- select(join(df, df2, df$name == df2$name, "outer"),
- alias(df$age + 5, "newAge"), df$name, df2$test)
- expect_equal(names(joined4), c("newAge", "name", "test"))
+
+ joined4 <- join(df, df2, df$name == df2$name, "right_outer")
+ expect_equal(names(joined4), c("age", "name", "name", "test"))
expect_equal(count(joined4), 4)
- expect_equal(collect(orderBy(joined4, joined4$name))$newAge[3], 24)
+ expect_true(is.na(collect(orderBy(joined4, joined4$age))$age[2]))
- joined5 <- join(df, df2, df$name == df2$name, "leftouter")
+ joined5 <- join(df, df2, df$name == df2$name, "rightouter")
expect_equal(names(joined5), c("age", "name", "name", "test"))
- expect_equal(count(joined5), 3)
- expect_true(is.na(collect(orderBy(joined5, joined5$age))$age[1]))
-
- joined6 <- join(df, df2, df$name == df2$name, "inner")
- expect_equal(names(joined6), c("age", "name", "name", "test"))
- expect_equal(count(joined6), 3)
+ expect_equal(count(joined5), 4)
+ expect_true(is.na(collect(orderBy(joined5, joined5$age))$age[2]))
- joined7 <- join(df, df2, df$name == df2$name, "leftsemi")
- expect_equal(names(joined7), c("age", "name"))
- expect_equal(count(joined7), 3)
- joined8 <- join(df, df2, df$name == df2$name, "left_outer")
- expect_equal(names(joined8), c("age", "name", "name", "test"))
- expect_equal(count(joined8), 3)
- expect_true(is.na(collect(orderBy(joined8, joined8$age))$age[1]))
-
- joined9 <- join(df, df2, df$name == df2$name, "right_outer")
- expect_equal(names(joined9), c("age", "name", "name", "test"))
+ joined6 <- select(join(df, df2, df$name == df2$name, "outer"),
+ alias(df$age + 5, "newAge"), df$name, df2$test)
+ expect_equal(names(joined6), c("newAge", "name", "test"))
+ expect_equal(count(joined6), 4)
+ expect_equal(collect(orderBy(joined6, joined6$name))$newAge[3], 24)
+
+ joined7 <- select(join(df, df2, df$name == df2$name, "full"),
+ alias(df$age + 5, "newAge"), df$name, df2$test)
+ expect_equal(names(joined7), c("newAge", "name", "test"))
+ expect_equal(count(joined7), 4)
+ expect_equal(collect(orderBy(joined7, joined7$name))$newAge[3], 24)
+
+ joined8 <- select(join(df, df2, df$name == df2$name, "fullouter"),
+ alias(df$age + 5, "newAge"), df$name, df2$test)
+ expect_equal(names(joined8), c("newAge", "name", "test"))
+ expect_equal(count(joined8), 4)
+ expect_equal(collect(orderBy(joined8, joined8$name))$newAge[3], 24)
+
+ joined9 <- select(join(df, df2, df$name == df2$name, "full_outer"),
+ alias(df$age + 5, "newAge"), df$name, df2$test)
+ expect_equal(names(joined9), c("newAge", "name", "test"))
expect_equal(count(joined9), 4)
- expect_true(is.na(collect(orderBy(joined9, joined9$age))$age[2]))
-
+ expect_equal(collect(orderBy(joined9, joined9$name))$newAge[3], 24)
+
+ joined10 <- join(df, df2, df$name == df2$name, "left")
+ expect_equal(names(joined10), c("age", "name", "name", "test"))
+ expect_equal(count(joined10), 3)
+ expect_true(is.na(collect(orderBy(joined10, joined10$age))$age[1]))
+
+ joined11 <- join(df, df2, df$name == df2$name, "leftouter")
+ expect_equal(names(joined11), c("age", "name", "name", "test"))
+ expect_equal(count(joined11), 3)
+ expect_true(is.na(collect(orderBy(joined11, joined11$age))$age[1]))
+
+ joined12 <- join(df, df2, df$name == df2$name, "left_outer")
+ expect_equal(names(joined12), c("age", "name", "name", "test"))
+ expect_equal(count(joined12), 3)
+ expect_true(is.na(collect(orderBy(joined12, joined12$age))$age[1]))
+
+ joined13 <- join(df, df2, df$name == df2$name, "inner")
+ expect_equal(names(joined13), c("age", "name", "name", "test"))
+ expect_equal(count(joined13), 3)
+
+ joined14 <- join(df, df2, df$name == df2$name, "semi")
+ expect_equal(names(joined14), c("age", "name"))
+ expect_equal(count(joined14), 3)
+
+ joined14 <- join(df, df2, df$name == df2$name, "leftsemi")
+ expect_equal(names(joined14), c("age", "name"))
+ expect_equal(count(joined14), 3)
+
+ joined15 <- join(df, df2, df$name == df2$name, "left_semi")
+ expect_equal(names(joined15), c("age", "name"))
+ expect_equal(count(joined15), 3)
+
+ joined16 <- join(df2, df, df2$name == df$name, "anti")
+ expect_equal(names(joined16), c("name", "test"))
+ expect_equal(count(joined16), 1)
+
+ joined17 <- join(df2, df, df2$name == df$name, "leftanti")
+ expect_equal(names(joined17), c("name", "test"))
+ expect_equal(count(joined17), 1)
+
+ joined18 <- join(df2, df, df2$name == df$name, "left_anti")
+ expect_equal(names(joined18), c("name", "test"))
+ expect_equal(count(joined18), 1)
+
+ error_msg <- paste("joinType must be one of the following types:",
+ "'inner', 'cross', 'outer', 'full', 'fullouter', 'full_outer',",
+ "'left', 'leftouter', 'left_outer', 'right', 'rightouter', 'right_outer',",
+ "'semi', 'leftsemi', 'left_semi', 'anti', 'leftanti' or 'left_anti'.")
+ expect_error(join(df2, df, df2$name == df$name, "invalid"), error_msg)
+
merged <- merge(df, df2, by.x = "name", by.y = "name", all.x = TRUE, all.y = TRUE)
expect_equal(count(merged), 4)
expect_equal(names(merged), c("age", "name_x", "name_y", "test"))
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 75dd9fb..8227e82 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1000,8 +1000,9 @@ class DataFrame(object):
If `on` is a string or a list of strings indicating the name of the join column(s),
the column(s) must exist on both sides, and this performs an equi-join.
:param how: str, default ``inner``. Must be one of: ``inner``, ``cross``, ``outer``,
- ``full``, ``full_outer``, ``left``, ``left_outer``, ``right``, ``right_outer``,
- ``left_semi``, and ``left_anti``.
+ ``full``, ``fullouter``, ``full_outer``, ``left``, ``leftouter``, ``left_outer``,
+ ``right``, ``rightouter``, ``right_outer``, ``semi``, ``leftsemi``, ``left_semi``,
+ ``anti``, ``leftanti`` and ``left_anti``.
The following performs a full outer join between ``df1`` and ``df2``.
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index c2c2ebc..2accb32 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -936,8 +936,9 @@ class Dataset[T] private[sql](
* @param right Right side of the join operation.
* @param usingColumns Names of the columns to join on. This columns must exist on both sides.
* @param joinType Type of join to perform. Default `inner`. Must be one of:
- * `inner`, `cross`, `outer`, `full`, `full_outer`, `left`, `left_outer`,
- * `right`, `right_outer`, `left_semi`, `semi`, `left_anti`, `anti`.
+ * `inner`, `cross`, `outer`, `full`, `fullouter`, `full_outer`, `left`,
+ * `leftouter`, `left_outer`, `right`, `rightouter`, `right_outer`,
+ * `semi`, `leftsemi`, `left_semi`, `anti`, `leftanti`, left_anti`.
*
* @note If you perform a self-join using this function without aliasing the input
* `DataFrame`s, you will NOT be able to reference any columns after the join, since
@@ -994,8 +995,9 @@ class Dataset[T] private[sql](
* @param right Right side of the join.
* @param joinExprs Join expression.
* @param joinType Type of join to perform. Default `inner`. Must be one of:
- * `inner`, `cross`, `outer`, `full`, `full_outer`, `left`, `left_outer`,
- * `right`, `right_outer`, `left_semi`, `semi`, `left_anti`, `anti`.
+ * `inner`, `cross`, `outer`, `full`, `fullouter`, `full_outer`, `left`,
+ * `leftouter`, `left_outer`, `right`, `rightouter`, `right_outer`,
+ * `semi`, `leftsemi`, `left_semi`, `anti`, `leftanti`, left_anti`.
*
* @group untypedrel
* @since 2.0.0
@@ -1078,8 +1080,8 @@ class Dataset[T] private[sql](
* @param other Right side of the join.
* @param condition Join expression.
* @param joinType Type of join to perform. Default `inner`. Must be one of:
- * `inner`, `cross`, `outer`, `full`, `full_outer`, `left`, `left_outer`,
- * `right`, `right_outer`.
+ * `inner`, `cross`, `outer`, `full`, `fullouter`,`full_outer`, `left`,
+ * `leftouter`, `left_outer`, `right`, `rightouter`, `right_outer`.
*
* @group typedrel
* @since 1.6.0
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org