You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/06 20:44:49 UTC

[GitHub] [arrow] boshek opened a new pull request, #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

boshek opened a new pull request, #12818:
URL: https://github.com/apache/arrow/pull/12818

   This PR does two things to match some dplyr behaviour around column order:
   
   1) Mimics dplyr implementation of `mutate(..., .keep = "none")` to append new columns after the existing columns (if suggested) as [per](https://github.com/tidyverse/dplyr/issues/6086)
   
   2) As per this [discussion](https://github.com/tidyverse/dplyr/issues/6086), this required a bespoke approach to `transmute` as it not simply a wrapper for `mutate(..., .keep = "none")`. This cascades into needing to catch a couple edge cases.
   
   I have also added some tests which will test for this behaviour. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] boshek commented on a diff in pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
boshek commented on code in PR #12818:
URL: https://github.com/apache/arrow/pull/12818#discussion_r844468847


##########
r/tests/testthat/test-dplyr-mutate.R:
##########
@@ -74,6 +74,16 @@ test_that("transmute", {
   )
 })
 
+test_that("transmute respect bespoke dplyr implementation", {
+  ## see: https://github.com/tidyverse/dplyr/issues/6086
+  compare_dplyr_binding(
+    .input %>%
+      transmute(dbl, int = int + 6L) %>%
+      collect(),
+    tbl
+  )
+})

Review Comment:
   As in an unnamed argument like this?
   ```r
   df |> 
     transmute("foo")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] boshek commented on a diff in pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
boshek commented on code in PR #12818:
URL: https://github.com/apache/arrow/pull/12818#discussion_r844468847


##########
r/tests/testthat/test-dplyr-mutate.R:
##########
@@ -74,6 +74,16 @@ test_that("transmute", {
   )
 })
 
+test_that("transmute respect bespoke dplyr implementation", {
+  ## see: https://github.com/tidyverse/dplyr/issues/6086
+  compare_dplyr_binding(
+    .input %>%
+      transmute(dbl, int = int + 6L) %>%
+      collect(),
+    tbl
+  )
+})

Review Comment:
   As in an unnamed argument like this?
   ```r
   df |> 
     transmute("foo")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12818:
URL: https://github.com/apache/arrow/pull/12818#issuecomment-1093535833

   Benchmark runs are scheduled for baseline = f0b5c49a0d60b5240b584ea27773a210aa7cead2 and contender = b8299436c8b1a2d7cd3d6e019a2b750893a3af87. b8299436c8b1a2d7cd3d6e019a2b750893a3af87 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/af754f90443849adb39662dc890f3752...b474b1dbf70442f699aca1e18377bff8/)
   [Finished :arrow_down:0.17% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/607f028c2bf748a98e6bf822ac86eb84...9799711906404af683e2d08acb51b7ec/)
   [Finished :arrow_down:0.71% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2805c3772ff64fe48bcd3ea35a8bd022...525f10b2021c418793df71b4f47e9eca/)
   [Finished :arrow_down:0.17% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6ddc2217def48bdad6e0d85194a3ecf...bfb863f16a65415bbd628f1145fcadf6/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/473| `b8299436` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/458| `b8299436` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/459| `b8299436` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/468| `b8299436` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/472| `f0b5c49a` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/457| `f0b5c49a` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/458| `f0b5c49a` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/467| `f0b5c49a` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] boshek commented on a diff in pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
boshek commented on code in PR #12818:
URL: https://github.com/apache/arrow/pull/12818#discussion_r844466749


##########
r/R/dplyr-mutate.R:
##########
@@ -112,7 +113,15 @@ mutate.Dataset <- mutate.ArrowTabular <- mutate.RecordBatchReader <- mutate.arro
 
 transmute.arrow_dplyr_query <- function(.data, ...) {
   dots <- check_transmute_args(...)
-  dplyr::mutate(.data, !!!dots, .keep = "none")
+  has_null <- vapply(dots, quo_is_null, logical(1))

Review Comment:
   good to know. i wondered about this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] boshek commented on a diff in pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
boshek commented on code in PR #12818:
URL: https://github.com/apache/arrow/pull/12818#discussion_r845388673


##########
r/R/dplyr-mutate.R:
##########
@@ -112,7 +113,15 @@ mutate.Dataset <- mutate.ArrowTabular <- mutate.RecordBatchReader <- mutate.arro
 
 transmute.arrow_dplyr_query <- function(.data, ...) {
   dots <- check_transmute_args(...)
-  dplyr::mutate(.data, !!!dots, .keep = "none")
+  has_null <- vapply(dots, quo_is_null, logical(1))
+  .data <- dplyr::mutate(.data, !!!dots, .keep = "none")
+  if (is_empty(dots) | any(has_null)) return(.data)
+
+  ## keeping with: https://github.com/tidyverse/dplyr/issues/6086
+  cur_exprs <- vapply(dots, as_label, character(1))
+  new_col_names <- names(dots)
+  transmute_order <- ifelse(nzchar(new_col_names), new_col_names, cur_exprs)

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] boshek commented on a diff in pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
boshek commented on code in PR #12818:
URL: https://github.com/apache/arrow/pull/12818#discussion_r844464046


##########
r/R/dplyr-mutate.R:
##########
@@ -112,7 +113,15 @@ mutate.Dataset <- mutate.ArrowTabular <- mutate.RecordBatchReader <- mutate.arro
 
 transmute.arrow_dplyr_query <- function(.data, ...) {
   dots <- check_transmute_args(...)
-  dplyr::mutate(.data, !!!dots, .keep = "none")
+  has_null <- vapply(dots, quo_is_null, logical(1))
+  .data <- dplyr::mutate(.data, !!!dots, .keep = "none")
+  if (is_empty(dots) | any(has_null)) return(.data)
+
+  ## keeping with: https://github.com/tidyverse/dplyr/issues/6086
+  cur_exprs <- vapply(dots, as_label, character(1))
+  new_col_names <- names(dots)
+  transmute_order <- ifelse(nzchar(new_col_names), new_col_names, cur_exprs)

Review Comment:
   Very open to other approaches but the dplyr implementation of `transmute` return the columns in the order they are given in the function call which is now distinct from `mutate(..., .keep = "none")`.  So this was what i came up with to preserve that order. But there possibly could be a better way to capture order? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jonkeane commented on a diff in pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
jonkeane commented on code in PR #12818:
URL: https://github.com/apache/arrow/pull/12818#discussion_r845378529


##########
r/R/dplyr-mutate.R:
##########
@@ -112,7 +113,15 @@ mutate.Dataset <- mutate.ArrowTabular <- mutate.RecordBatchReader <- mutate.arro
 
 transmute.arrow_dplyr_query <- function(.data, ...) {
   dots <- check_transmute_args(...)
-  dplyr::mutate(.data, !!!dots, .keep = "none")
+  has_null <- vapply(dots, quo_is_null, logical(1))
+  .data <- dplyr::mutate(.data, !!!dots, .keep = "none")
+  if (is_empty(dots) | any(has_null)) return(.data)
+
+  ## keeping with: https://github.com/tidyverse/dplyr/issues/6086
+  cur_exprs <- vapply(dots, as_label, character(1))
+  new_col_names <- names(dots)
+  transmute_order <- ifelse(nzchar(new_col_names), new_col_names, cur_exprs)

Review Comment:
   `dplyr::coalesce` works here (and since this is a dplyr function, we can be pretty sure dplyr is already installed), but generally we try to shy away from using dplyr as a dependency. 
   
   So I would say let's get the names from `cur_exprs` but then do the more base `transmute_order[is.na(names(...))] <- ...` assignment style. IMO it also is slightly more explicit about what we're doing (replacing anything that doesn't have a name with a name-like representation of itself)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12818:
URL: https://github.com/apache/arrow/pull/12818#issuecomment-1090776177

   https://issues.apache.org/jira/browse/ARROW-16038


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jonkeane closed pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
jonkeane closed pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set
URL: https://github.com/apache/arrow/pull/12818


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] boshek commented on a diff in pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
boshek commented on code in PR #12818:
URL: https://github.com/apache/arrow/pull/12818#discussion_r845302595


##########
r/R/dplyr-mutate.R:
##########
@@ -112,7 +113,15 @@ mutate.Dataset <- mutate.ArrowTabular <- mutate.RecordBatchReader <- mutate.arro
 
 transmute.arrow_dplyr_query <- function(.data, ...) {
   dots <- check_transmute_args(...)
-  dplyr::mutate(.data, !!!dots, .keep = "none")
+  has_null <- vapply(dots, quo_is_null, logical(1))
+  .data <- dplyr::mutate(.data, !!!dots, .keep = "none")
+  if (is_empty(dots) | any(has_null)) return(.data)
+
+  ## keeping with: https://github.com/tidyverse/dplyr/issues/6086
+  cur_exprs <- vapply(dots, as_label, character(1))
+  new_col_names <- names(dots)
+  transmute_order <- ifelse(nzchar(new_col_names), new_col_names, cur_exprs)

Review Comment:
   > maybe the intention is to replace any thing in dots that doesn't have a name with what's in cur_expers
   
   Yes you are exactly right. I just realized that because `map_chr` returns a named vector I can equally get the names from `names(cur_exprs)`. Also `coalesce` might work here too. What about this?
   
   ```r
     cur_exprs <- map_chr(dots, as_label)
     transmute_order <- dplyr::coalesce(na_if(names(cur_exprs), ""), cur_exprs)
     ``` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jonkeane commented on a diff in pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
jonkeane commented on code in PR #12818:
URL: https://github.com/apache/arrow/pull/12818#discussion_r844432758


##########
r/R/dplyr-mutate.R:
##########
@@ -94,7 +94,8 @@ mutate.arrow_dplyr_query <- function(.data,
 
   # Respect .keep
   if (.keep == "none") {
-    .data$selected_columns <- .data$selected_columns[new_vars]
+    new_cols_last <- c(intersect(old_vars, new_vars), setdiff(new_vars, old_vars))
+    .data$selected_columns <- .data$selected_columns[new_cols_last]

Review Comment:
   It might be nice to have a comment explaining what we're doing here



##########
r/R/dplyr-mutate.R:
##########
@@ -112,7 +113,15 @@ mutate.Dataset <- mutate.ArrowTabular <- mutate.RecordBatchReader <- mutate.arro
 
 transmute.arrow_dplyr_query <- function(.data, ...) {
   dots <- check_transmute_args(...)
-  dplyr::mutate(.data, !!!dots, .keep = "none")
+  has_null <- vapply(dots, quo_is_null, logical(1))

Review Comment:
   This is rather stylistic, but I think `map_lgl(dots, quo_is_null)` would fit a bit more with that package style (I see quite a few more `purrr::map`s compared to `vapply`s)



##########
r/R/dplyr-mutate.R:
##########
@@ -112,7 +113,15 @@ mutate.Dataset <- mutate.ArrowTabular <- mutate.RecordBatchReader <- mutate.arro
 
 transmute.arrow_dplyr_query <- function(.data, ...) {
   dots <- check_transmute_args(...)
-  dplyr::mutate(.data, !!!dots, .keep = "none")
+  has_null <- vapply(dots, quo_is_null, logical(1))
+  .data <- dplyr::mutate(.data, !!!dots, .keep = "none")
+  if (is_empty(dots) | any(has_null)) return(.data)
+
+  ## keeping with: https://github.com/tidyverse/dplyr/issues/6086
+  cur_exprs <- vapply(dots, as_label, character(1))
+  new_col_names <- names(dots)
+  transmute_order <- ifelse(nzchar(new_col_names), new_col_names, cur_exprs)

Review Comment:
   I'm having a bit of trouble following what this line is doing here — what behavior are we trying to catch here?



##########
r/tests/testthat/test-dplyr-mutate.R:
##########
@@ -329,13 +339,13 @@ test_that("dplyr::mutate's examples", {
   #>   <chr> <chr> <dbl>
   #> 1 a     b         3
   compare_dplyr_binding(
-    .input %>% mutate(z = x + y, .keep = "none") %>% collect(), # same as transmute()
+    .input %>% mutate(z = x + y, x, .keep = "none") %>% collect(), # same as transmute()
     df
   )
-  #> # A tibble: 1 x 1
-  #>       z
-  #>   <dbl>
-  #> 1     3
+  #> # A tibble: 1 × 2
+  #>       x     z
+  #>   <dbl> <dbl>
+  #> 1     1     3

Review Comment:
   Thanks for updating all of this!



##########
r/R/dplyr-mutate.R:
##########
@@ -112,7 +113,15 @@ mutate.Dataset <- mutate.ArrowTabular <- mutate.RecordBatchReader <- mutate.arro
 
 transmute.arrow_dplyr_query <- function(.data, ...) {
   dots <- check_transmute_args(...)
-  dplyr::mutate(.data, !!!dots, .keep = "none")
+  has_null <- vapply(dots, quo_is_null, logical(1))
+  .data <- dplyr::mutate(.data, !!!dots, .keep = "none")
+  if (is_empty(dots) | any(has_null)) return(.data)

Review Comment:
   This is a style nit-pick, but could we put the `return(.data)` on it's own line? (cf https://style.tidyverse.org/functions.html#return)



##########
r/tests/testthat/test-dplyr-mutate.R:
##########
@@ -74,6 +74,16 @@ test_that("transmute", {
   )
 })
 
+test_that("transmute respect bespoke dplyr implementation", {
+  ## see: https://github.com/tidyverse/dplyr/issues/6086
+  compare_dplyr_binding(
+    .input %>%
+      transmute(dbl, int = int + 6L) %>%
+      collect(),
+    tbl
+  )
+})

Review Comment:
   Could we add a test with unnamed arguments to `transmute()` here as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12818:
URL: https://github.com/apache/arrow/pull/12818#issuecomment-1090776203

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jonkeane commented on a diff in pull request #12818: ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set

Posted by GitBox <gi...@apache.org>.
jonkeane commented on code in PR #12818:
URL: https://github.com/apache/arrow/pull/12818#discussion_r845144322


##########
r/R/dplyr-mutate.R:
##########
@@ -112,7 +113,15 @@ mutate.Dataset <- mutate.ArrowTabular <- mutate.RecordBatchReader <- mutate.arro
 
 transmute.arrow_dplyr_query <- function(.data, ...) {
   dots <- check_transmute_args(...)
-  dplyr::mutate(.data, !!!dots, .keep = "none")
+  has_null <- vapply(dots, quo_is_null, logical(1))
+  .data <- dplyr::mutate(.data, !!!dots, .keep = "none")
+  if (is_empty(dots) | any(has_null)) return(.data)
+
+  ## keeping with: https://github.com/tidyverse/dplyr/issues/6086
+  cur_exprs <- vapply(dots, as_label, character(1))
+  new_col_names <- names(dots)
+  transmute_order <- ifelse(nzchar(new_col_names), new_col_names, cur_exprs)

Review Comment:
   _nods_ sorry I should have been a bit more specific — I didn't totally follow what might cause the names of `new_col_names` to be empty (I imagine that's what happens with any current columns at this point, given this `ifelse()`: so here `cur_exprs` has names if it's a new column, but does not if it doesn't, yeah?), but that isn't super super obvious on first read. And actually, now that I'm looking at it, maybe the intention is to replace any thing in `dots` that doesn't have a name with what's in `cur_expers`?
   
   Would it be possible to do something like:
   
   ```
   cur_exprs <- map_chr(dots, as_label)
   transmute_order <- names(dots)
   transmute_order[nzchar(transmute_order)] <- cur_exprs[nzchar(transmute_order)]
   ```
   
   or even something like `purrr::modify` or `purrr::modify_if`? https://purrr.tidyverse.org/reference/modify.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org