You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/06 21:17:45 UTC

[GitHub] [arrow] thisisnic opened a new pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

thisisnic opened a new pull request #10261:
URL: https://github.com/apache/arrow/pull/10261


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627898242



##########
File path: r/tests/testthat/test-RecordBatch.R
##########
@@ -15,7 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.
 
-context("RecordBatch")
+library(dplyr)

Review comment:
       Restore the `context()` function here, and use `dplyr::` before `group_by()` below so you don't need to load dplyr here
   ```suggestion
   context("RecordBatch")
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627895271



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays
   # making sure there are always names
   if (is.null(names(arrays))) {
     names(arrays) <- rep_len("", length(arrays))
   }
   stopifnot(length(arrays) > 0)
 
+  # Preserve any grouping

Review comment:
       I think this new code should be moved to above `# Else, list of arrays` (but I'm not 100% sure)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627895409



##########
File path: r/R/table.R
##########
@@ -168,6 +168,13 @@ Table$create <- function(..., schema = NULL) {
     names(dots) <- rep_len("", length(dots))
   }
   stopifnot(length(dots) > 0)
+  
+  # Preserve any grouping
+  if (length(dots) == 1 && inherits(dots[[1]], "grouped_df")) {
+    out <- Table__from_dots(dots, schema)
+    return(group_by(out, !!!groups(dots[[1]])))

Review comment:
       ```suggestion
       return(dplyr::group_by(out, !!!dplyr::groups(dots[[1]])))
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#issuecomment-834453078


   This looks great @thisisnic! I'll merge after the CI is green 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627898887



##########
File path: r/tests/testthat/test-Table.R
##########
@@ -15,7 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.
 
-context("Table")
+library(dplyr)

Review comment:
       ```suggestion
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r628233978



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays

Review comment:
       ```suggestion
     # Else, a list of arrays or data.frames
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627895271



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays
   # making sure there are always names
   if (is.null(names(arrays))) {
     names(arrays) <- rep_len("", length(arrays))
   }
   stopifnot(length(arrays) > 0)
 
+  # Preserve any grouping

Review comment:
       I think this new code should be moved to above `# Else, list of arrays`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627989698



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays
   # making sure there are always names
   if (is.null(names(arrays))) {
     names(arrays) <- rep_len("", length(arrays))
   }
   stopifnot(length(arrays) > 0)
 
+  # Preserve any grouping

Review comment:
       When I was testing it, it needed to be below that bit which gives `arrays` an empty character as a name instead of being `NULL`, otherwise the call to `RecordBatch__from_arrays` raises the following error: `Error in RecordBatch__from_arrays(schema, arrays) :   STRING_ELT() can only be applied to a 'character vector', not a 'NULL'`

##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays
   # making sure there are always names
   if (is.null(names(arrays))) {
     names(arrays) <- rep_len("", length(arrays))
   }
   stopifnot(length(arrays) > 0)
 
+  # Preserve any grouping

Review comment:
       When I was testing it, it needed to be below that bit which gives `arrays` an empty character as a name instead of being `NULL`, otherwise the later call to `RecordBatch__from_arrays` raises the following error: `Error in RecordBatch__from_arrays(schema, arrays) :   STRING_ELT() can only be applied to a 'character vector', not a 'NULL'`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627898242



##########
File path: r/tests/testthat/test-RecordBatch.R
##########
@@ -15,7 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.
 
-context("RecordBatch")
+library(dplyr)

Review comment:
       ~~Restore the `context()` function here~~, and use `dplyr::` before `group_by()` and `group_vars()` below so you don't need to load dplyr here
   ```suggestion
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627893108



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays
   # making sure there are always names
   if (is.null(names(arrays))) {
     names(arrays) <- rep_len("", length(arrays))
   }
   stopifnot(length(arrays) > 0)
 
+  # Preserve any grouping
+  if (length(arrays) == 1 && inherits(arrays[[1]], "grouped_df")) {
+    out <- RecordBatch__from_arrays(schema, arrays)
+    return(group_by(out, !!!groups(arrays[[1]])))

Review comment:
       dplyr is not in Arrow's Depends or Imports, so we need to use `dplyr::` before `group_by` and other dplyr functions. This line will error if dplyr is not installed, but that's reasonable because it's enclosed in a conditional that should prevent it from being run in most cases where dplyr is not installed 
   ```suggestion
       return(dplyr::group_by(out, !!!groups(arrays[[1]])))
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r628235553



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays
   # making sure there are always names
   if (is.null(names(arrays))) {
     names(arrays) <- rep_len("", length(arrays))
   }
   stopifnot(length(arrays) > 0)
 
+  # Preserve any grouping

Review comment:
       Oh ok, understood, thanks. This is a quirk of the `count_fields` function I guess. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627898326



##########
File path: r/tests/testthat/test-RecordBatch.R
##########
@@ -499,3 +499,21 @@ test_that("Handling string data with embedded nuls", {
     )
   })
 })
+
+test_that("ARROW-11769 - grouping preserved in record batch creation", {
+  
+  tbl <- tibble::tibble(
+    int = 1:10,
+    fct = factor(rep(c("A", "B"), 5)),
+    fct2 = factor(rep(c("C", "D"), each = 5)),
+  )
+  
+  expect_identical(
+    tbl %>%
+      group_by(fct, fct2) %>%

Review comment:
       ```suggestion
         dplyr::group_by(fct, fct2) %>%
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627898957



##########
File path: r/tests/testthat/test-Table.R
##########
@@ -475,3 +475,21 @@ test_that("Table$create() with different length columns", {
   expect_error(Table$create(a=1:5, b = 42), msg)
   expect_error(Table$create(a=1:5, b = 1:6), msg)
 })
+
+test_that("ARROW-11769 - grouping preserved in table creation", {
+  
+  tbl <- tibble::tibble(
+    int = 1:10,
+    fct = factor(rep(c("A", "B"), 5)),
+    fct2 = factor(rep(c("C", "D"), each = 5)),
+  )
+  
+  expect_identical(
+    tbl %>%
+      group_by(fct, fct2) %>%
+      Table$create() %>%
+      group_vars(),

Review comment:
       ```suggestion
         dplyr::group_by(fct, fct2) %>%
         Table$create() %>%
         dplyr::group_vars(),
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r628235553



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays
   # making sure there are always names
   if (is.null(names(arrays))) {
     names(arrays) <- rep_len("", length(arrays))
   }
   stopifnot(length(arrays) > 0)
 
+  # Preserve any grouping

Review comment:
       Oh ok, understood, thanks. This is a quirk of the `count_fields` C++ function I guess. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627893108



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays
   # making sure there are always names
   if (is.null(names(arrays))) {
     names(arrays) <- rep_len("", length(arrays))
   }
   stopifnot(length(arrays) > 0)
 
+  # Preserve any grouping
+  if (length(arrays) == 1 && inherits(arrays[[1]], "grouped_df")) {
+    out <- RecordBatch__from_arrays(schema, arrays)
+    return(group_by(out, !!!groups(arrays[[1]])))

Review comment:
       dplyr is not in Arrow's Depends or Imports, so we need to use `dplyr::` before `group_by()`, `groups()`, and other dplyr functions. This line will error if dplyr is not installed, but that's reasonable because it's enclosed in a conditional that should prevent it from being run in most cases where dplyr is not installed 
   ```suggestion
       return(dplyr::group_by(out, !!!dplyr::groups(arrays[[1]])))
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r628233978



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays

Review comment:
       ```suggestion
     # Else, a list of arrays or one or more data.frames
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627895409



##########
File path: r/R/table.R
##########
@@ -168,6 +168,13 @@ Table$create <- function(..., schema = NULL) {
     names(dots) <- rep_len("", length(dots))
   }
   stopifnot(length(dots) > 0)
+  
+  # Preserve any grouping
+  if (length(dots) == 1 && inherits(dots[[1]], "grouped_df")) {
+    out <- Table__from_dots(dots, schema)
+    return(group_by(out, !!!groups(dots[[1]])))

Review comment:
       ```suggestion
       return(dplyr::group_by(out, !!!groups(dots[[1]])))
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627893108



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays
   # making sure there are always names
   if (is.null(names(arrays))) {
     names(arrays) <- rep_len("", length(arrays))
   }
   stopifnot(length(arrays) > 0)
 
+  # Preserve any grouping
+  if (length(arrays) == 1 && inherits(arrays[[1]], "grouped_df")) {
+    out <- RecordBatch__from_arrays(schema, arrays)
+    return(group_by(out, !!!groups(arrays[[1]])))

Review comment:
       dplyr is not in Arrow's Depends or Imports, so we need to use `dplyr::` before `group_by()`, `groups()`, and other dplyr functions. 
   ```suggestion
       return(dplyr::group_by(out, !!!dplyr::groups(arrays[[1]])))
   ```
   This line will error if dplyr is not installed, but that's reasonable because it's enclosed in a conditional that should prevent it from being run in most cases where dplyr is not installed 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627992188



##########
File path: r/tests/testthat/test-RecordBatch.R
##########
@@ -15,7 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.
 
-context("RecordBatch")
+library(dplyr)

Review comment:
       In PRs on other tests, I think some people have been removing `context` as it doesn't add anything useful to the CI output, so I've just been getting rid of it when I've been working with certain files.  Happy to add back in, but sounds like more discussion needed?
   
   Cool, won't import `dplyr` here.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#issuecomment-833873728


   https://issues.apache.org/jira/browse/ARROW-11769


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627898887



##########
File path: r/tests/testthat/test-Table.R
##########
@@ -15,7 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.
 
-context("Table")
+library(dplyr)

Review comment:
       ```suggestion
   context("Table")
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627893108



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays
   # making sure there are always names
   if (is.null(names(arrays))) {
     names(arrays) <- rep_len("", length(arrays))
   }
   stopifnot(length(arrays) > 0)
 
+  # Preserve any grouping
+  if (length(arrays) == 1 && inherits(arrays[[1]], "grouped_df")) {
+    out <- RecordBatch__from_arrays(schema, arrays)
+    return(group_by(out, !!!groups(arrays[[1]])))

Review comment:
       dplyr is not in Arrow's Depends or Imports, so we need to use `dplyr::` before `group_by` and other dplyr functions. This line will error if dplyr is not installed, but that's reasonable because it's enclosed in a conditional that should prevent it from being run in most cases where dplyr is not installed 
   ```suggestion
       return(dplyr::group_by(out, !!!dplyr::groups(arrays[[1]])))
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627898783



##########
File path: r/tests/testthat/test-RecordBatch.R
##########
@@ -499,3 +499,21 @@ test_that("Handling string data with embedded nuls", {
     )
   })
 })
+
+test_that("ARROW-11769 - grouping preserved in record batch creation", {
+  
+  tbl <- tibble::tibble(
+    int = 1:10,
+    fct = factor(rep(c("A", "B"), 5)),
+    fct2 = factor(rep(c("C", "D"), each = 5)),
+  )
+  
+  expect_identical(
+    tbl %>%
+      group_by(fct, fct2) %>%
+      record_batch() %>%
+      group_vars(),

Review comment:
       ```suggestion
         dplyr::group_by(fct, fct2) %>%
         record_batch() %>%
         dplyr::group_vars(),
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627898242



##########
File path: r/tests/testthat/test-RecordBatch.R
##########
@@ -15,7 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.
 
-context("RecordBatch")
+library(dplyr)

Review comment:
       Restore the `context()` function here, and use `dplyr::` before `group_by()` and `group_vars()` below so you don't need to load dplyr here
   ```suggestion
   context("RecordBatch")
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627989698



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays
   # making sure there are always names
   if (is.null(names(arrays))) {
     names(arrays) <- rep_len("", length(arrays))
   }
   stopifnot(length(arrays) > 0)
 
+  # Preserve any grouping

Review comment:
       When I was testing it, it needed to be below that bit which gives it an empty character as a name instead of being `NULL`, otherwise the call to `RecordBatch__from_arrays` raises the following error: `Error in RecordBatch__from_arrays(schema, arrays) :   STRING_ELT() can only be applied to a 'character vector', not a 'NULL'`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r628229729



##########
File path: r/tests/testthat/test-RecordBatch.R
##########
@@ -15,7 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.
 
-context("RecordBatch")
+library(dplyr)

Review comment:
       Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r627893108



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays
   # making sure there are always names
   if (is.null(names(arrays))) {
     names(arrays) <- rep_len("", length(arrays))
   }
   stopifnot(length(arrays) > 0)
 
+  # Preserve any grouping
+  if (length(arrays) == 1 && inherits(arrays[[1]], "grouped_df")) {
+    out <- RecordBatch__from_arrays(schema, arrays)
+    return(group_by(out, !!!groups(arrays[[1]])))

Review comment:
       dplyr is not in Arrow's Depends or Imports, so we need to use `dplyr::` before `group_by()`, `groups()`, and other dplyr functions
   ```suggestion
       return(dplyr::group_by(out, !!!dplyr::groups(arrays[[1]])))
   ```
   This line will error if dplyr is not installed, but that's reasonable because it's enclosed in a conditional that should prevent it from being run in most cases where dplyr is not installed 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r628222255



##########
File path: r/tests/testthat/test-RecordBatch.R
##########
@@ -15,7 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.
 
-context("RecordBatch")
+library(dplyr)

Review comment:
       Ah, understood, thanks. I modified these suggestions accordingly.

##########
File path: r/tests/testthat/test-RecordBatch.R
##########
@@ -15,7 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.
 
-context("RecordBatch")
+library(dplyr)

Review comment:
       Restore the `context()` function here, and use `dplyr::` before `group_by()` and `group_vars()` below so you don't need to load dplyr here
   ```suggestion
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook closed pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook closed pull request #10261:
URL: https://github.com/apache/arrow/pull/10261


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10261: ARROW-11769: [R] Pull groups from grouped_df into RecordBatch or Table

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10261:
URL: https://github.com/apache/arrow/pull/10261#discussion_r628233978



##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays

Review comment:
       ```suggestion
     # Else, list of arrays or a data.frame
   ```

##########
File path: r/R/record-batch.R
##########
@@ -148,13 +148,20 @@ RecordBatch$create <- function(..., schema = NULL) {
   if (length(arrays) == 1 && inherits(arrays[[1]], c("raw", "Buffer", "InputStream", "Message"))) {
     return(RecordBatch$from_message(arrays[[1]], schema))
   }
+  
   # Else, list of arrays

Review comment:
       ```suggestion
     # Else, a list of arrays or a data.frame
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org