You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "eitsupi (via GitHub)" <gi...@apache.org> on 2023/05/08 10:34:22 UTC

[GitHub] [arrow] eitsupi opened a new pull request, #35473: GH-35445 : [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr

eitsupi opened a new pull request, #35473:
URL: https://github.com/apache/arrow/pull/35473

   ### Rationale for this change
   
   The argument `.cols` of the `dplyr::across` function has the following description.
   
   > You can't select grouping columns because they are already automatically handled by the verb (i.e. summarise() or mutate()).
   
   However, this behavior is currently not reproduced in the `arrow` package and an error occurs when selecting the column used for grouping with `everything()`.
   
   ``` r
   mtcars |>
     arrow::as_arrow_table() |>
     dplyr::group_by(cyl) |>
     dplyr::summarise(dplyr::across(everything(), sum)) |>
     dplyr::collect()
   #> Error in `compute.arrow_dplyr_query()`:
   #> ! Invalid: Multiple matches for FieldRef.Name(cyl) in mpg: double
   #> cyl: double
   #> disp: double
   #> hp: double
   #> drat: double
   #> wt: double
   #> qsec: double
   #> vs: double
   #> am: double
   #> gear: double
   #> carb: double
   #> cyl: double
   #> Backtrace:
   #>     ▆
   #>  1. ├─dplyr::collect(...)
   #>  2. └─arrow:::collect.arrow_dplyr_query(...)
   #>  3.   └─arrow:::compute.arrow_dplyr_query(x)
   #>  4.     └─base::tryCatch(...)
   #>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
   #>  6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
   #>  7.           └─value[[3L]](cond)
   #>  8.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
   #>  9.               └─rlang::abort(msg, call = call)
   ```
   
   <sup>Created on 2023-05-05 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
   
   This PR fixes this behavior to match with dplyr's original behavior.
   
   ### What changes are included in this PR?
   
   - Auto exclude grouping cloumns in `across` in `mutate`, `transmute`, and `summarise`.
   - `mutate`, `transmute`, `arrange`, `filter` always return `arrow_dplyr_query`.
     Currently, `arrow_dplyr_query` is not returned in the following cases, which was not consistent. 
     ```r
     mtcars |> arrow::arrow_table() |> dplyr::mutate()
     ```
   - Correct the order of columns in results of `group_by(foo) |> mutate(.keep = "none")`
     Currently, the results of the following query show that the columns used for grouping have moved to the tail and differ from the behavior of dplyr.
     ```r
     mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::mutate(am, .keep = "none") |> dplyr::collect()
     ```
   
   ### Are these changes tested?
   
   Yes.
   
   ### Are there any user-facing changes?
   
   Yes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35473: GH-35445: [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35473:
URL: https://github.com/apache/arrow/pull/35473#issuecomment-1555996926

   Benchmark runs are scheduled for baseline = 3e4eaa917fa9b09a923d255adee520aa68a4e78c and contender = 6bd00508116edea5afcdc4e3e11cd9fa789b70a3. 6bd00508116edea5afcdc4e3e11cd9fa789b70a3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/0fb5d12a13ab40438a3006581b42239a...814b1e69a9c0417e87304aed6f42dd1b/)
   [Finished :arrow_down:0.42% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/370b402801084fc2acdfd3caaf3f6579...0cbc622326074df4abc3aef4e7fc4946/)
   [Finished :arrow_down:1.31% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/4dc8cca243854c2889f480decb95e0ee...bfb82668901d406892ce069b3eb4b859/)
   [Finished :arrow_down:0.57% :arrow_up:0.03%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/86c5130da89f4eba92d05466de6c7b2a...8343ad670a3a4134b5b55a181496b5a1/)
   Buildkite builds:
   [Finished] [`6bd00508` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2895)
   [Finished] [`6bd00508` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2931)
   [Finished] [`6bd00508` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2896)
   [Finished] [`6bd00508` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2921)
   [Finished] [`3e4eaa91` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2894)
   [Finished] [`3e4eaa91` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2930)
   [Finished] [`3e4eaa91` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2895)
   [Finished] [`3e4eaa91` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2920)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic merged pull request #35473: GH-35445: [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic merged PR #35473:
URL: https://github.com/apache/arrow/pull/35473


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35473: GH-35445 : [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35473:
URL: https://github.com/apache/arrow/pull/35473#issuecomment-1538148677

   * Closes: #35445


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on pull request #35473: GH-35445: [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.
eitsupi commented on PR #35473:
URL: https://github.com/apache/arrow/pull/35473#issuecomment-1542384815

   In the process of updating the test, I noticed that `transmute` was not working correctly.
   I will fix it later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on a diff in pull request #35473: GH-35445: [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on code in PR #35473:
URL: https://github.com/apache/arrow/pull/35473#discussion_r1192930273


##########
r/tests/testthat/test-dplyr-mutate.R:
##########
@@ -652,3 +672,41 @@ test_that("Can use across() within transmute()", {
     example_data
   )
 })
+
+test_that("across() does not select grouping variables within mutate()", {
+  compare_dplyr_binding(
+    .input %>%
+      group_by(chr) %>%
+      mutate(across(everything(), round)) %>%
+      collect(),
+    example_data %>%
+      select(int, dbl, chr)

Review Comment:
   nit: it's more skimmable when the `tbl` parameter to `compare_dplyr_binding()` is just a table with no modification. Please could you move the `select()` to the `expr` parameter code instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35473: GH-35445: [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35473:
URL: https://github.com/apache/arrow/pull/35473#issuecomment-1555998210

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/4dc8cca243854c2889f480decb95e0ee...bfb82668901d406892ce069b3eb4b859/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35473: GH-35445 : [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35473:
URL: https://github.com/apache/arrow/pull/35473#issuecomment-1538148737

   :warning: GitHub issue #35445 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on pull request #35473: GH-35445: [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.
eitsupi commented on PR #35473:
URL: https://github.com/apache/arrow/pull/35473#issuecomment-1542437044

   I noticed that the column order is also wrong after `select`, but this is beyond the scope of this pull request, so I will create another issue.
   
   ```r
   > mtcars |> group_by(cyl) |> select(mpg) |> collect()
   Adding missing grouping variables: `cyl`
   # A tibble: 32 × 2
   # Groups:   cyl [3]
        cyl   mpg
      <dbl> <dbl>
    1     6  21  
    2     6  21  
    3     4  22.8
    4     6  21.4
    5     8  18.7
    6     6  18.1
    7     8  14.3
    8     4  24.4
    9     4  22.8
   10     6  19.2
   # … with 22 more rows
   # ℹ Use `print(n = ...)` to see more rows
   
   > mtcars |> arrow_table() |> group_by(cyl) |> select(mpg) |> collect()
   # A tibble: 32 × 2
   # Groups:   cyl [3]
        mpg   cyl
      <dbl> <dbl>
    1  21       6
    2  21       6
    3  22.8     4
    4  21.4     6
    5  18.7     8
    6  18.1     6
    7  14.3     8
    8  24.4     4
    9  22.8     4
   10  19.2     6
   # … with 22 more rows
   # ℹ Use `print(n = ...)` to see more rows
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org