You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/18 19:54:38 UTC

[GitHub] [arrow] jonkeane commented on a change in pull request #12179: ARROW-14609 [R] left_join by argument error message mismatch

jonkeane commented on a change in pull request #12179:
URL: https://github.com/apache/arrow/pull/12179#discussion_r787102302



##########
File path: r/R/dplyr-join.R
##########
@@ -117,10 +117,33 @@ handle_join_by <- function(by, x, y) {
   if (is.null(names(by))) {
     by <- set_names(by)
   }
-  # TODO: nicer messages?
-  stopifnot(
-    all(names(by) %in% names(x)),
-    all(by %in% names(y))
-  )
+
+  missing_x_cols <- setdiff(names(by), names(x))
+  if (length(missing_x_cols) > 0) {
+    message <- paste(
+      "Join",
+      ngettext(length(missing_x_cols), "column", "columns"),
+      "must be present in data."
+    )
+    message_x <- paste(
+      oxford_paste(missing_x_cols, quote_symbol = "`"),
+      "not present in x."
+      )
+    abort(c(message, x = message_x))
+  }
+
+  missing_y_cols <- setdiff(by, names(y))
+  if (length(missing_y_cols) > 0) {
+    message <- paste(
+      "Join",
+      ngettext(length(missing_y_cols), "column", "columns"),
+      "must be present in data."
+    )
+    message_y <- paste(
+      oxford_paste(missing_y_cols, quote_symbol = "`"),
+      "not present in y."
+    )
+    abort(c(message, x = message_y))
+  }

Review comment:
       Would it be possible to accumulate message_x and message_y in parallel and then print them both? This might ultimately simplify this (e.g. the "join column must be present in data" would happen once). But more importantly it would give people the feedback about both x and y at the beginning instead of fixing one and then having another pop up.
   
   

##########
File path: r/tests/testthat/test-dplyr-join.R
##########
@@ -90,9 +90,57 @@ test_that("Error handling", {
     left_tab %>%
       left_join(to_join, by = "not_a_col") %>%
       collect(),
-    "all(names(by) %in% names(x)) is not TRUE",
-    fixed = TRUE
+    "Join column must be present in data"
   )
+  expect_snapshot({
+    (expect_error(

Review comment:
       The extra wrapping of `()`s is to get the error from a whole dplyr chain, right? (at least, that's what I think https://github.com/r-lib/testthat/issues/1471 is saying). 
   
   In these cases, there's not actually a chain, so these should just work (right?)

##########
File path: r/tests/testthat/_snaps/dplyr-join.md
##########
@@ -0,0 +1,50 @@
+# Error handling
+
+    Code
+      (expect_error(left_join(arrow_table(example_data), arrow_table(example_data),
+      by = "made_up_colname")))
+    Output
+      <error/rlang_error>
+      Join column must be present in data.
+      x `made_up_colname` not present in x.

Review comment:
       The error messages including x/y is a nice touch, though it should be pointed out that does differ from dplyr:
   
   ```
   library(dplyr)
   #> 
   #> Attaching package: 'dplyr'
   #> The following objects are masked from 'package:stats':
   #> 
   #>     filter, lag
   #> The following objects are masked from 'package:base':
   #> 
   #>     intersect, setdiff, setequal, union
   left_join(mtcars, mtcars, by = "foo")
   #> Error: Join columns must be present in data.
   #> ✖ Problem with `foo`.
   ```
   
   Giving the explicit x/y is friendlier though, so I'm happy to keep it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org