You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "eitsupi (via GitHub)" <gi...@apache.org> on 2023/05/18 12:35:56 UTC

[GitHub] [arrow] eitsupi opened a new pull request, #35667: GH-33987: [R] Support new dplyr .by/by argument

eitsupi opened a new pull request, #35667:
URL: https://github.com/apache/arrow/pull/35667

   ### Rationale for this change
   
   Implement the `.by` argument for `mutate`, `summarise`, `filter` and `slice_*` family.
   
   ### What changes are included in this PR?
   
   The `.by` argument that matches `dplyr` has been added to some functions.
   
   Most of the internal functions, such as `compute_by`, are copied from the existing `dplyr` backends, `dbplyr` and `dtplyr`.
   
   ### Are these changes tested?
   
   Yes.
   
   ### Are there any user-facing changes?
   
   Yes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on a diff in pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on code in PR #35667:
URL: https://github.com/apache/arrow/pull/35667#discussion_r1211990586


##########
r/tests/testthat/test-dplyr-filter.R:
##########
@@ -425,3 +425,34 @@ test_that("filter() with across()", {
     tbl
   )
 })
+
+test_that(".by argument", {
+  compare_dplyr_binding(
+    .input %>%
+      filter(is.na(lgl), .by = chr) %>%
+      select(chr, int, lgl) %>%
+      collect(),
+    tbl
+  )
+  compare_dplyr_binding(
+    .input %>%
+      filter(.by = chr) %>%
+      select(chr, int, lgl) %>%
+      collect(),
+    tbl
+  )
+  compare_dplyr_binding(
+    .input %>%
+      filter(int > 2, pnorm(dbl) > .99, .by = chr) %>%
+      collect(),
+    tbl,
+    warning = "Expression pnorm\\(dbl\\) > 0.99 not supported in Arrow; pulling data into R"
+  )

Review Comment:
   Great, thanks for clarifying that!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.
eitsupi commented on PR #35667:
URL: https://github.com/apache/arrow/pull/35667#issuecomment-1579576145

   Are there any plans for merging? (I am just worried this has been forgotten)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.
eitsupi commented on PR #35667:
URL: https://github.com/apache/arrow/pull/35667#issuecomment-1580314482

   Thanks for merging!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic merged pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic merged PR #35667:
URL: https://github.com/apache/arrow/pull/35667


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on PR #35667:
URL: https://github.com/apache/arrow/pull/35667#issuecomment-1563987875

   @eitsupi - if you give this PR a rebase to fix the CI failures, I'll take a look at this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35667:
URL: https://github.com/apache/arrow/pull/35667#issuecomment-1552988300

   :warning: GitHub issue #33987 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.
eitsupi commented on PR #35667:
URL: https://github.com/apache/arrow/pull/35667#issuecomment-1570313958

   > Please could you add or update some of the tests to use more than 1 grouping variable here?
   
   I think I did it, but these are pretty much the same test cases. I think it is better to parameterize these tests using `patrick`.
   I created an issue for that #35844.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35667:
URL: https://github.com/apache/arrow/pull/35667#issuecomment-1552988244

   * Closes: #33987


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on a diff in pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on code in PR #35667:
URL: https://github.com/apache/arrow/pull/35667#discussion_r1211606915


##########
r/tests/testthat/test-dplyr-filter.R:
##########
@@ -425,3 +425,34 @@ test_that("filter() with across()", {
     tbl
   )
 })
+
+test_that(".by argument", {
+  compare_dplyr_binding(
+    .input %>%
+      filter(is.na(lgl), .by = chr) %>%
+      select(chr, int, lgl) %>%
+      collect(),
+    tbl
+  )
+  compare_dplyr_binding(
+    .input %>%
+      filter(.by = chr) %>%
+      select(chr, int, lgl) %>%
+      collect(),
+    tbl
+  )
+  compare_dplyr_binding(
+    .input %>%
+      filter(int > 2, pnorm(dbl) > .99, .by = chr) %>%
+      collect(),
+    tbl,
+    warning = "Expression pnorm\\(dbl\\) > 0.99 not supported in Arrow; pulling data into R"
+  )

Review Comment:
   This test is pulling the data into R, so we end up comparing dplyr with dplyr instead of Arrow.  Is the purpose of this test to test with multiple filters? If so, how about swapping `pnorm()` out for something else?  Or is the intention here different?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.
eitsupi commented on PR #35667:
URL: https://github.com/apache/arrow/pull/35667#issuecomment-1565935235

   > @eitsupi - if you give this PR a rebase to fix the CI failures, I'll take a look at this
   
   Done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on PR #35667:
URL: https://github.com/apache/arrow/pull/35667#issuecomment-1580180981

   Thanks for the reminder @eitsupi!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35667:
URL: https://github.com/apache/arrow/pull/35667#issuecomment-1581972140

   Benchmark runs are scheduled for baseline = c62ce6b179f67ff57fa1571b01f356739727c0c9 and contender = a0d28deefd660adc6b14c4ac5fe4ed4175d4fedb. a0d28deefd660adc6b14c4ac5fe4ed4175d4fedb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/c34492d03c0d4af2a64849bbb0384b52...6f75f09fb7ac4bc28af5f9466249038f/)
   [Finished :arrow_down:0.71% :arrow_up:0.06%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/5e5afce2fb204d74ab38e7a1ef08d704...14d42f70a65744498f4d5e602e00584d/)
   [Finished :arrow_down:0.33% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6ec002e0d47c448a866df7f7a09dc5d0...04eb720e6aea48348e7ef3d4698effe2/)
   [Finished :arrow_down:0.45% :arrow_up:0.03%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/4b4ae4a5444c4f95a27e477b4a1068a8...cf879a6424fb45349e3485f2281db9d5/)
   Buildkite builds:
   [Finished] [`a0d28dee` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2996)
   [Finished] [`a0d28dee` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/3032)
   [Finished] [`a0d28dee` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2997)
   [Finished] [`a0d28dee` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/3022)
   [Finished] [`c62ce6b1` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2995)
   [Finished] [`c62ce6b1` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/3031)
   [Finished] [`c62ce6b1` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2996)
   [Finished] [`c62ce6b1` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/3021)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on a diff in pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.
eitsupi commented on code in PR #35667:
URL: https://github.com/apache/arrow/pull/35667#discussion_r1211733586


##########
r/tests/testthat/test-dplyr-filter.R:
##########
@@ -425,3 +425,34 @@ test_that("filter() with across()", {
     tbl
   )
 })
+
+test_that(".by argument", {
+  compare_dplyr_binding(
+    .input %>%
+      filter(is.na(lgl), .by = chr) %>%
+      select(chr, int, lgl) %>%
+      collect(),
+    tbl
+  )
+  compare_dplyr_binding(
+    .input %>%
+      filter(.by = chr) %>%
+      select(chr, int, lgl) %>%
+      collect(),
+    tbl
+  )
+  compare_dplyr_binding(
+    .input %>%
+      filter(int > 2, pnorm(dbl) > .99, .by = chr) %>%
+      collect(),
+    tbl,
+    warning = "Expression pnorm\\(dbl\\) > 0.99 not supported in Arrow; pulling data into R"
+  )

Review Comment:
   This is an intentional test to ensure that no grouping occurs during conversion to data.frame.
   Added a comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on a diff in pull request #35667: GH-33987: [R] Support new dplyr .by/by argument

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on code in PR #35667:
URL: https://github.com/apache/arrow/pull/35667#discussion_r1211580456


##########
r/R/dplyr-by.R:
##########
@@ -0,0 +1,77 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+compute_by <- function(by, data, ..., by_arg = "by", data_arg = "data", error_call = caller_env()) {
+  check_dots_empty0(...)
+
+  by <- enquo(by)
+  check_by(by, data, by_arg = by_arg, data_arg = data_arg, error_call = error_call)
+
+  if (is_grouped_adq(data)) {
+    names <- data$group_by_vars
+    from_by <- FALSE
+  } else {
+    names <- eval_select_by(by, data, error_call = error_call)
+    from_by <- TRUE
+  }
+
+  new_by(from_by = from_by, names = names)
+}
+
+is_grouped_adq <- function(data) {
+  !is_empty(data$group_by_vars)
+}
+
+check_by <- function(by,
+                     data,
+                     ...,
+                     by_arg = "by",
+                     data_arg = "data",
+                     error_call = caller_env()) {
+  check_dots_empty0(...)
+
+  if (quo_is_null(by)) {
+    return(invisible(NULL))
+  }
+
+  if (is_grouped_adq(data)) {
+    message <- paste0(
+      "Can't supply `", by_arg, "` when `",
+      data_arg, "` is a grouped arrow dplyr query."
+    )

Review Comment:
   The end user isn't necessarily aware of what a "grouped arrow dplyr query" is.  It's great that you're trying to mirror the original dplyr error message, but I think we might want to consider doing something else here.  Maybe "Can't supply `.by` on grouped data"?  Or even diverging from the dplyr messaging and going with something like "`.by` can only be used on ungrouped data".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org