You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by pa...@apache.org on 2023/01/15 03:02:11 UTC

[arrow] branch master updated: GH-14981: [R] Forward compatibility with dplyr::join_by() (#33664)

This is an automated email from the ASF dual-hosted git repository.

paleolimbot pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new fbcaee1e3e GH-14981: [R] Forward compatibility with dplyr::join_by()  (#33664)
fbcaee1e3e is described below

commit fbcaee1e3e662ef79e0502006f5966faa7c93989
Author: Ian Cook <ia...@gmail.com>
AuthorDate: Sat Jan 14 22:02:03 2023 -0500

    GH-14981: [R] Forward compatibility with dplyr::join_by()  (#33664)
    
    # Which issue does this PR close?
    
    Closes #14981
    
    # Rationale for this change
    
    dplyr 1.1.0 introduces a new function `join_by()` for specifying join conditions. This PR adds support for `join_by()` in dplyr joins on Arrow objects. The support is limited only to equality conditions. Code added in this PR throws an error if the user specifies inequality conditions or uses helper functions in `join_by()`.
    
    https://www.tidyverse.org/blog/2022/11/dplyr-1-1-0-is-coming-soon/#join-improvements
    
    <!--
     Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
     Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
    -->
    
    # What changes are included in this PR?
    
    - Code to handle `join_by()` in dplyr joins on Arrow objects with equality conditions
    - Tests of handling of `join_by()`, which are skipped when the version of dplyr is less than `1.0.99.9000` which is the current version number of the development version of dplyr on GitHub which that become version `1.1.0` on CRAN.
    
    
    # Are these changes tested?
    
    Yes
    
    # Are there any user-facing changes?
    
    Yes, the new dplyr syntax for specifying join conditions is supported, but use of this new syntax is optional. The old dplyr join syntax will continue to work. There are no breaking changes in this PR.
    * Closes: #14981
    
    Authored-by: Ian Cook <ia...@gmail.com>
    Signed-off-by: Dewey Dunnington <de...@fishandwhistle.net>
---
 r/R/dplyr-join.R                   | 11 +++++++++
 r/tests/testthat/test-dplyr-join.R | 50 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/r/R/dplyr-join.R b/r/R/dplyr-join.R
index fad44b5ef2..2ba3c307c1 100644
--- a/r/R/dplyr-join.R
+++ b/r/R/dplyr-join.R
@@ -136,6 +136,17 @@ handle_join_by <- function(by, x, y) {
   if (is.null(by)) {
     return(set_names(intersect(names(x), names(y))))
   }
+  if (inherits(by, "dplyr_join_by")) {
+    if (!all(by$condition == "==" & by$filter == "none")) {
+      abort(
+        paste0(
+          "Inequality conditions and helper functions ",
+          "are not supported in `join_by()` expressions."
+        )
+      )
+    }
+    by <- set_names(by$y, by$x)
+  }
   stopifnot(is.character(by))
   if (is.null(names(by))) {
     by <- set_names(by)
diff --git a/r/tests/testthat/test-dplyr-join.R b/r/tests/testthat/test-dplyr-join.R
index 5c6798aeeb..3470a886b3 100644
--- a/r/tests/testthat/test-dplyr-join.R
+++ b/r/tests/testthat/test-dplyr-join.R
@@ -67,6 +67,39 @@ test_that("left_join `by` args", {
   )
 })
 
+test_that("left_join with join_by", {
+  # only run this test in newer versions of dplyr that include `join_by()`
+  skip_if_not(packageVersion("dplyr") >= "1.0.99.9000")
+
+  compare_dplyr_binding(
+    .input %>%
+      left_join(to_join, join_by(some_grouping)) %>%
+      collect(),
+    left
+  )
+  compare_dplyr_binding(
+    .input %>%
+      left_join(
+        to_join %>%
+          rename(the_grouping = some_grouping),
+          join_by(some_grouping == the_grouping)
+      ) %>%
+      collect(),
+    left
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      rename(the_grouping = some_grouping) %>%
+      left_join(
+        to_join,
+        join_by(the_grouping == some_grouping)
+      ) %>%
+      collect(),
+    left
+  )
+})
+
 test_that("join two tables", {
   expect_identical(
     arrow_table(left) %>%
@@ -136,6 +169,23 @@ test_that("Error handling", {
   )
 })
 
+test_that("Error handling for unsupported expressions in join_by", {
+  # only run this test in newer versions of dplyr that include `join_by()`
+  skip_if_not(packageVersion("dplyr") >= "1.0.99.9000")
+
+  expect_error(
+    arrow_table(left) %>%
+      left_join(to_join, join_by(some_grouping >= some_grouping)),
+    "not supported"
+  )
+
+  expect_error(
+    arrow_table(left) %>%
+      left_join(to_join, join_by(closest(some_grouping >= some_grouping))),
+    "not supported"
+  )
+})
+
 # TODO: test duplicate col names
 # TODO: casting: int and float columns?