You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/29 23:08:00 UTC

[GitHub] [arrow] wjones127 opened a new pull request #12751: ARROW-15989: rbind & cbind for Table & RecordBatch

wjones127 opened a new pull request #12751:
URL: https://github.com/apache/arrow/pull/12751


   ## Summary of Changes
   
    * Added `rbind` and `cbind` for Table
    * Added `cbind` for RecordBatch. `rbind` just redirects the user to use `Table$create`
    * Changed `c.Array()` implementation to use either `concat_array()` or `ChunkedArray$create()` depending on whether the user wants a single array or zero-copy.
    * Implemented `c.ChunkedArray`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12751: ARROW-15989: rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#issuecomment-1082454916


   https://issues.apache.org/jira/browse/ARROW-15989


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#issuecomment-1085070078


   > But it is _really_ useful and difficult to replicate if there's a good way to do it here!
   
   I 100% agree. However, there seems to be an issue on the C++ side blocking this. I've created a new issue to tackle that. https://issues.apache.org/jira/browse/ARROW-16085


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#discussion_r839942127



##########
File path: r/R/table.R
##########
@@ -149,6 +149,64 @@ Table$create <- function(..., schema = NULL) {
 #' @export
 names.Table <- function(x) x$ColumnNames()
 
+#' @export
+rbind.Table <- function(...) {
+  tables <- list(...)
+
+  # assert they have same schema
+  schema <- tables[[1]]$schema
+  unequal_schema_idx <- which.min(lapply(tables, function(x) x$schema == schema))
+  if (unequal_schema_idx != 1) {
+    stop(paste0(
+      sprintf("Schema at index %i does not match the first schema\n", unequal_schema_idx),
+      "Schema 1:\n",
+      tables[[1]]$schema$ToString(),
+      sprintf("\nSchema %i:\n", unequal_schema_idx),
+      tables[[unequal_schema_idx]]$schema$ToString()
+    ))
+  }
+
+  # create chunked array from each column
+  columns <- vector(mode = "list", length = tables[[1]]$num_columns)
+  for (i in seq_len(length(columns))) {
+    columns[[i]] <- do.call(c, lapply(tables, function(tab) tab[[i]]))
+  }
+
+  # return new table
+  args <- columns
+  names(args) <- names(schema)
+  args$schema <- schema
+  do.call(Table$create, args)
+}
+
+#' @export
+cbind.Table <- function(...) {
+  tables <- list(...)
+
+  # Assert they have the same length
+  unequal_length_idx <- which.min(lapply(tables, function(x) x$num_rows == tables[[1]]$num_rows))

Review comment:
       Tried it, and TBH I think I prefer the error as is. The vctrs error just lacks helpful context:
   
   ```
   Error in `stop_vctrs(message, class = c(class, "vctrs_error_incompatible"), 
       x = x, y = y, details = details, ...)`: Can't recycle `..1` (size 10) to match `..2` (size 2).
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#issuecomment-1086109154


   Seems that S3 dispatch isn't behaving nicely in R 3.6 for `cbind.Table`:
   
   ```
   > do.call(cbind.Table, inputs)
   Table
   2 rows x 5 columns
   $z <int32>
   $a <int32>
   $b <int32>
   $c <dictionary<values=string, indices=int8>>
   $d <int32>
   > do.call(cbind, inputs)
             b c d
    [1,] ? ? ? 1 1
    [2,] ? ? ? 2 1
    [3,] ? ? ? 1 1
    [4,] ? ? ? 2 1
    [5,] ? ? ? 1 1
    [6,] ? ? ? 2 1
    [7,] ? ? ? 1 1
    [8,] ? ? ? 2 1
    [9,] ? ? ? 1 1
   [10,] ? ? ? 2 1
   [11,] ? ? ? 1 1
   [12,] ? ? ? 2 1
   [13,] ? ? ? 1 1
   [14,] ? ? ? 2 1
   [15,] ? ? ? 1 1
   [16,] ? ? ? 2 1
   [17,] ? ? ? 1 1
   [18,] ? ? ? 2 1
   [19,] ? ? ? 1 1
   [20,] ? ? ? 2 1
   [21,] ? ? ? 1 1
   [22,] ? ? ? 2 1
   [23,] ? ? ? 1 1
   [24,] ? ? ? 2 1
   [25,] ? ? ? 1 1
   [26,] ? ? ? 2 1
   [27,] ? ? ? 1 1
   [28,] ? ? ? 2 1
   [29,] ? ? ? 1 1
   [30,] ? ? ? 2 1
   [31,] ? ? ? 1 1
   [32,] ? ? ? 2 1
   [33,] ? ? ? 1 1
   [34,] ? ? ? 2 1
   [35,] ? ? ? 1 1
   [36,] ? ? ? 2 1
   Warning message:
   In (function (..., deparse.level = 1)  :
     number of rows of result is not a multiple of vector length (arg 2)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#discussion_r839930496



##########
File path: r/R/table.R
##########
@@ -149,6 +149,64 @@ Table$create <- function(..., schema = NULL) {
 #' @export
 names.Table <- function(x) x$ColumnNames()
 
+#' @export
+rbind.Table <- function(...) {
+  tables <- list(...)
+
+  # assert they have same schema
+  schema <- tables[[1]]$schema
+  unequal_schema_idx <- which.min(lapply(tables, function(x) x$schema == schema))
+  if (unequal_schema_idx != 1) {
+    stop(paste0(
+      sprintf("Schema at index %i does not match the first schema\n", unequal_schema_idx),
+      "Schema 1:\n",
+      tables[[1]]$schema$ToString(),
+      sprintf("\nSchema %i:\n", unequal_schema_idx),
+      tables[[unequal_schema_idx]]$schema$ToString()
+    ))
+  }
+
+  # create chunked array from each column
+  columns <- vector(mode = "list", length = tables[[1]]$num_columns)
+  for (i in seq_len(length(columns))) {
+    columns[[i]] <- do.call(c, lapply(tables, function(tab) tab[[i]]))
+  }
+
+  # return new table
+  args <- columns
+  names(args) <- names(schema)
+  args$schema <- schema
+  do.call(Table$create, args)
+}
+
+#' @export
+cbind.Table <- function(...) {
+  tables <- list(...)
+
+  # Assert they have the same length
+  unequal_length_idx <- which.min(lapply(tables, function(x) x$num_rows == tables[[1]]$num_rows))

Review comment:
       🤦  Ah right. 
   
   And this allocation thing is because `vec_size_common` wants to convert into data frames, right?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#discussion_r839098244



##########
File path: r/R/chunked-array.R
##########
@@ -148,6 +153,12 @@ ChunkedArray$create <- function(..., type = NULL) {
   ChunkedArray__from_list(list2(...), type)
 }
 
+#' @export
+c.ChunkedArray <- function(...) {
+  arrays <- list(...)
+  do.call(ChunkedArray$create, unlist(lapply(arrays, function(arr) arr$chunks)))

Review comment:
       - Maybe `chunked_arrays` instead of `arrays`? (Given that both appear here and the difference matters)
   - If you can rig it so that you do `ChunkedArrays$create(!!! list_of_arrays)` it might be cleaner (and might give a nicer error message when it fails)
   - What happens when the arrays have different types? I think I rigged something in `concat_arrays()` so that you can specify the final type which might be worth adopting here too.

##########
File path: r/R/record-batch.R
##########
@@ -189,3 +189,36 @@ record_batch <- RecordBatch$create
 
 #' @export
 names.RecordBatch <- function(x) x$names()
+
+#' @export
+rbind.RecordBatch <- function(...) {
+  stop("Use Table$create to combine record batches")
+}
+
+#' @export
+cbind.RecordBatch <- function(...) {
+  batches <- list(...)
+
+  # Assert they have the same length
+  unequal_length_idx <- which.min(lapply(batches, function(x) x$num_rows == batches[[1]]$num_rows))

Review comment:
       See the note below about maybe using `vctrs::vec_size_common()` here

##########
File path: r/R/array.R
##########
@@ -256,7 +255,7 @@ concat_arrays <- function(..., type = NULL) {
 #' @rdname concat_arrays
 #' @export
 c.Array <- function(...) {
-  concat_arrays(...)
+  stop("Use concat_arrays() to create a new array, ChunkedArray$create() to create a zero-copy concatenation")

Review comment:
       It's not perfect, but I think we're generally trying to converge around `abort()` for error messages.
   
   ```suggestion
     abort("Use concat_arrays() to create a new array, ChunkedArray$create() to create a zero-copy concatenation")
   ```

##########
File path: r/tests/testthat/test-Array.R
##########
@@ -1035,30 +1035,6 @@ test_that("concat_arrays() coerces its input to Array", {
   )
 })
 
-test_that("c() works for Array", {

Review comment:
       You should probably test the error here (`expect_snapshot(..., error = TRUE)` or `expect_error()`)

##########
File path: r/R/table.R
##########
@@ -149,6 +149,64 @@ Table$create <- function(..., schema = NULL) {
 #' @export
 names.Table <- function(x) x$ColumnNames()
 
+#' @export
+rbind.Table <- function(...) {
+  tables <- list(...)
+
+  # assert they have same schema
+  schema <- tables[[1]]$schema
+  unequal_schema_idx <- which.min(lapply(tables, function(x) x$schema == schema))
+  if (unequal_schema_idx != 1) {
+    stop(paste0(
+      sprintf("Schema at index %i does not match the first schema\n", unequal_schema_idx),
+      "Schema 1:\n",
+      tables[[1]]$schema$ToString(),
+      sprintf("\nSchema %i:\n", unequal_schema_idx),
+      tables[[unequal_schema_idx]]$schema$ToString()
+    ))
+  }
+
+  # create chunked array from each column
+  columns <- vector(mode = "list", length = tables[[1]]$num_columns)
+  for (i in seq_len(length(columns))) {
+    columns[[i]] <- do.call(c, lapply(tables, function(tab) tab[[i]]))
+  }
+
+  # return new table
+  args <- columns
+  names(args) <- names(schema)
+  args$schema <- schema
+  do.call(Table$create, args)
+}
+
+#' @export
+cbind.Table <- function(...) {
+  tables <- list(...)
+
+  # Assert they have the same length
+  unequal_length_idx <- which.min(lapply(tables, function(x) x$num_rows == tables[[1]]$num_rows))

Review comment:
       You could use:
   
   ``` r
   vctrs::vec_size_common(seq_len(5), seq_len(10))
   #> Error in `stop_vctrs()`:
   #> ! Can't recycle `..1` (size 5) to match `..2` (size 10).
   ```
   
   ...with the slight complication that you're not actually recycling Tables with 1 row to match the length of the data. Maybe it's worth considering that (it's what `dplyr::bind_cols()`, which is powered by `vctrs::vec_cbind()` does).
   
   

##########
File path: r/R/table.R
##########
@@ -149,6 +149,64 @@ Table$create <- function(..., schema = NULL) {
 #' @export
 names.Table <- function(x) x$ColumnNames()
 
+#' @export
+rbind.Table <- function(...) {
+  tables <- list(...)
+
+  # assert they have same schema
+  schema <- tables[[1]]$schema

Review comment:
       If we use `UnionDataset` + `InMemoryDataset` do we have a better chance of combining tables with different schemas (at least filling missing columns with nulls and/or reconciling schemas that are equal except for the order of the columns?).

##########
File path: r/tests/testthat/test-RecordBatch.R
##########
@@ -513,6 +513,26 @@ test_that("record_batch() with different length arrays", {
   expect_error(record_batch(a = 1:5, b = 1:6), msg)
 })
 
+test_that("RecordBatch supports cbind", {
+  expect_error(
+    cbind(
+      RecordBatch$create(a = 1:10, ),
+      RecordBatch$create(a = c("a", "b"))
+    ),
+    regexp = "unequal number of rows"
+  )
+
+  batches <- list(
+    RecordBatch$create(a = c(1, 2), b = c("a", "b")),
+    RecordBatch$create(a = c("d", "c")),
+    RecordBatch$create(c = c(2, 3))
+  )
+
+  expected <- RecordBatch$create(
+    do.call(cbind, lapply(batches, function(batch) as.data.frame(batch))))
+  expect_equal(do.call(cbind, batches), expected, ignore_attr = TRUE)
+})
+

Review comment:
       ...and test the error for `rbind()` too

##########
File path: r/R/record-batch.R
##########
@@ -189,3 +189,36 @@ record_batch <- RecordBatch$create
 
 #' @export
 names.RecordBatch <- function(x) x$names()
+
+#' @export
+rbind.RecordBatch <- function(...) {
+  stop("Use Table$create to combine record batches")

Review comment:
       ```suggestion
     abort("Use Table$create to combine record batches")
   ```

##########
File path: r/R/record-batch.R
##########
@@ -189,3 +189,36 @@ record_batch <- RecordBatch$create
 
 #' @export
 names.RecordBatch <- function(x) x$names()
+
+#' @export
+rbind.RecordBatch <- function(...) {
+  stop("Use Table$create to combine record batches")
+}
+
+#' @export
+cbind.RecordBatch <- function(...) {
+  batches <- list(...)
+
+  # Assert they have the same length
+  unequal_length_idx <- which.min(lapply(batches, function(x) x$num_rows == batches[[1]]$num_rows))
+  if (unequal_length_idx != 1) {
+    stop(

Review comment:
       ```suggestion
       abort(
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#discussion_r839832203



##########
File path: r/R/table.R
##########
@@ -149,6 +149,64 @@ Table$create <- function(..., schema = NULL) {
 #' @export
 names.Table <- function(x) x$ColumnNames()
 
+#' @export
+rbind.Table <- function(...) {
+  tables <- list(...)
+
+  # assert they have same schema
+  schema <- tables[[1]]$schema
+  unequal_schema_idx <- which.min(lapply(tables, function(x) x$schema == schema))
+  if (unequal_schema_idx != 1) {
+    stop(paste0(
+      sprintf("Schema at index %i does not match the first schema\n", unequal_schema_idx),
+      "Schema 1:\n",
+      tables[[1]]$schema$ToString(),
+      sprintf("\nSchema %i:\n", unequal_schema_idx),
+      tables[[unequal_schema_idx]]$schema$ToString()
+    ))
+  }
+
+  # create chunked array from each column
+  columns <- vector(mode = "list", length = tables[[1]]$num_columns)
+  for (i in seq_len(length(columns))) {
+    columns[[i]] <- do.call(c, lapply(tables, function(tab) tab[[i]]))
+  }
+
+  # return new table
+  args <- columns
+  names(args) <- names(schema)
+  args$schema <- schema
+  do.call(Table$create, args)
+}
+
+#' @export
+cbind.Table <- function(...) {
+  tables <- list(...)
+
+  # Assert they have the same length
+  unequal_length_idx <- which.min(lapply(tables, function(x) x$num_rows == tables[[1]]$num_rows))

Review comment:
       hmmm not sure how to use this function. This isn't quite what we want:
   
   ```r
   > tables <- list(Table$create(a = 1:10), Table$create(a = c("a", "b")))
   > vctrs::vec_size_common(tables)
   [1] 2
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#issuecomment-1085009789


   That's a good point...`rbind()` doesn't do that for any type of object. But it is *really* useful and difficult to replicate if there's a good way to do it here!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12751: ARROW-15989: rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#discussion_r837979911



##########
File path: r/R/record-batch.R
##########
@@ -189,3 +189,36 @@ record_batch <- RecordBatch$create
 
 #' @export
 names.RecordBatch <- function(x) x$names()
+
+#' @export
+rbind.RecordBatch <- function(...) {
+  stop("Use Table$create to combine record batches")
+}
+
+#' @export
+cbind.RecordBatch <- function(...) {
+  batches <- list(...)
+
+  # Assert they have the same length
+  unequal_length_idx <- which.min(lapply(batches, function(x) x$num_rows == batches[[1]]$num_rows))
+  if (unequal_length_idx != 1) {
+    stop(
+      sprintf(
+        "Cannot cbind batches with unequal number of rows. Batch 1 has %i, batch %i has %i",
+        batches[[1]]$num_rows,
+        unequal_length_idx,
+        batches[[unequal_length_idx]]$num_rows
+      )
+    )
+  }
+
+  fields <- unlist(lapply(batches, function(tab) tab$schema$fields))
+  schema <- Schema$create(fields)
+  columns <- unlist(lapply(batches, function(tab) tab$columns))
+
+  # return new table
+  args <- columns
+  names(args) <- names(schema)
+  args$schema <- schema
+  do.call(RecordBatch$create, args)

Review comment:
       This feels a little silly to me, so would welcome suggestions for a better way to write this :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#issuecomment-1083610036


   cc @jonkeane @paleolimbot 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#issuecomment-1086171120


   > That is wild! A way around that would be to define our own arrow_rbind() and arrow_cbind() and error for built-in rbind() and cbind() since it's harder to predict exactly which method will get called.
   
   That's a good point. I think the R 4.0+ behavior works well, and not sure how much longer we are supporting 3.6, so what I've done for now is just modify the tests to not enforce that particular case in R < 4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#discussion_r839944534



##########
File path: r/R/table.R
##########
@@ -149,6 +149,64 @@ Table$create <- function(..., schema = NULL) {
 #' @export
 names.Table <- function(x) x$ColumnNames()
 
+#' @export
+rbind.Table <- function(...) {
+  tables <- list(...)
+
+  # assert they have same schema
+  schema <- tables[[1]]$schema
+  unequal_schema_idx <- which.min(lapply(tables, function(x) x$schema == schema))
+  if (unequal_schema_idx != 1) {
+    stop(paste0(
+      sprintf("Schema at index %i does not match the first schema\n", unequal_schema_idx),
+      "Schema 1:\n",
+      tables[[1]]$schema$ToString(),
+      sprintf("\nSchema %i:\n", unequal_schema_idx),
+      tables[[unequal_schema_idx]]$schema$ToString()
+    ))
+  }
+
+  # create chunked array from each column
+  columns <- vector(mode = "list", length = tables[[1]]$num_columns)
+  for (i in seq_len(length(columns))) {
+    columns[[i]] <- do.call(c, lapply(tables, function(tab) tab[[i]]))
+  }
+
+  # return new table
+  args <- columns
+  names(args) <- names(schema)
+  args$schema <- schema
+  do.call(Table$create, args)
+}
+
+#' @export
+cbind.Table <- function(...) {
+  tables <- list(...)
+
+  # Assert they have the same length
+  unequal_length_idx <- which.min(lapply(tables, function(x) x$num_rows == tables[[1]]$num_rows))

Review comment:
       No problem. You've copied that error code in two places, so maybe wrap it up into a function?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#discussion_r839534767



##########
File path: r/R/chunked-array.R
##########
@@ -148,6 +153,12 @@ ChunkedArray$create <- function(..., type = NULL) {
   ChunkedArray__from_list(list2(...), type)
 }
 
+#' @export
+c.ChunkedArray <- function(...) {
+  arrays <- list(...)
+  do.call(ChunkedArray$create, unlist(lapply(arrays, function(arr) arr$chunks)))

Review comment:
       Also, it's possible for somebody to do `c(this_is_a_chunked_array, 1:10, c("a", "b", "c))` (so you might want to do `chunked_arrays <- lapply(chunked_arrays, ChunkedArray$create)`).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#issuecomment-1086153469


   The reason for the odd dispatch is that R 3.6 looks for a `cbind` implementation for each argument, and if it finds multiple then it falls back to the built-in `cbind`. Our `cbind.RecordBatch` test only has one input with a cbind method (the record batch), but our `cbind.Table` test has two (record batch + table).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#discussion_r839090211



##########
File path: r/R/record-batch.R
##########
@@ -189,3 +189,36 @@ record_batch <- RecordBatch$create
 
 #' @export
 names.RecordBatch <- function(x) x$names()
+
+#' @export
+rbind.RecordBatch <- function(...) {
+  stop("Use Table$create to combine record batches")
+}
+
+#' @export
+cbind.RecordBatch <- function(...) {
+  batches <- list(...)
+
+  # Assert they have the same length
+  unequal_length_idx <- which.min(lapply(batches, function(x) x$num_rows == batches[[1]]$num_rows))
+  if (unequal_length_idx != 1) {
+    stop(
+      sprintf(
+        "Cannot cbind batches with unequal number of rows. Batch 1 has %i, batch %i has %i",
+        batches[[1]]$num_rows,
+        unequal_length_idx,
+        batches[[unequal_length_idx]]$num_rows
+      )
+    )
+  }
+
+  fields <- unlist(lapply(batches, function(tab) tab$schema$fields))
+  schema <- Schema$create(fields)
+  columns <- unlist(lapply(batches, function(tab) tab$columns))
+
+  # return new table
+  args <- columns
+  names(args) <- names(schema)
+  args$schema <- schema
+  do.call(RecordBatch$create, args)

Review comment:
       You should be able to do `RecordBatch$create(!!! args, schema = schema)` (because we use `rlang::list2()` to ingest `...`, which allows 'splicing' via `!!!`).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#discussion_r839924183



##########
File path: r/R/table.R
##########
@@ -149,6 +149,64 @@ Table$create <- function(..., schema = NULL) {
 #' @export
 names.Table <- function(x) x$ColumnNames()
 
+#' @export
+rbind.Table <- function(...) {
+  tables <- list(...)
+
+  # assert they have same schema
+  schema <- tables[[1]]$schema
+  unequal_schema_idx <- which.min(lapply(tables, function(x) x$schema == schema))
+  if (unequal_schema_idx != 1) {
+    stop(paste0(
+      sprintf("Schema at index %i does not match the first schema\n", unequal_schema_idx),
+      "Schema 1:\n",
+      tables[[1]]$schema$ToString(),
+      sprintf("\nSchema %i:\n", unequal_schema_idx),
+      tables[[unequal_schema_idx]]$schema$ToString()
+    ))
+  }
+
+  # create chunked array from each column
+  columns <- vector(mode = "list", length = tables[[1]]$num_columns)
+  for (i in seq_len(length(columns))) {
+    columns[[i]] <- do.call(c, lapply(tables, function(tab) tab[[i]]))
+  }
+
+  # return new table
+  args <- columns
+  names(args) <- names(schema)
+  args$schema <- schema
+  do.call(Table$create, args)
+}
+
+#' @export
+cbind.Table <- function(...) {
+  tables <- list(...)
+
+  # Assert they have the same length
+  unequal_length_idx <- which.min(lapply(tables, function(x) x$num_rows == tables[[1]]$num_rows))

Review comment:
       That output is correct (`tables` is a `list()` of size 2, and you've provided it with one argument)...you'd need to do:
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   tables <- list(Table$create(a = 1:10), Table$create(a = c("a", "b")))
   size_proxy <- lapply(tables, function(tbl) seq_len(tbl$num_rows))
   vctrs::vec_size_common(!!! size_proxy)
   #> Error in `stop_vctrs()`:
   #> ! Can't recycle `..1` (size 10) to match `..2` (size 2).
   ```
   
   `seq_len()` is good here because it doesn't actually allocate if the values are never accessed:
   
   ``` r
   x <- seq_len(1e9)
   lobstr::obj_size(x)
   #> 680 B
   vctrs::vec_size_common(x)
   #> [1] 1000000000
   lobstr::obj_size(x)
   #> 680 B
   x[1] <- 27
   lobstr::obj_size(x)
   #> 8,000,000,048 B
   ```
   
   You can still totally roll your own error if this gives a confusing one for the final behaviour!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#discussion_r839996652



##########
File path: r/R/table.R
##########
@@ -149,6 +149,64 @@ Table$create <- function(..., schema = NULL) {
 #' @export
 names.Table <- function(x) x$ColumnNames()
 
+#' @export
+rbind.Table <- function(...) {
+  tables <- list(...)
+
+  # assert they have same schema
+  schema <- tables[[1]]$schema

Review comment:
       Yes, but see: https://issues.apache.org/jira/browse/ARROW-16085




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#discussion_r839803582



##########
File path: r/R/chunked-array.R
##########
@@ -148,6 +153,12 @@ ChunkedArray$create <- function(..., type = NULL) {
   ChunkedArray__from_list(list2(...), type)
 }
 
+#' @export
+c.ChunkedArray <- function(...) {
+  arrays <- list(...)
+  do.call(ChunkedArray$create, unlist(lapply(arrays, function(arr) arr$chunks)))

Review comment:
       So if I modify `ChunkedArray$create()` to accept a chunked array as an argument, it basically becomes it's own `c()` implementation. It also already has a type parameter the handles coercing to a common type, so I think that's handled.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#issuecomment-1086165854


   That is wild! A way around that would be to define our own `arrow_rbind()` and `arrow_cbind()` and error for built-in `rbind()` and `cbind()` since it's harder to predict exactly which method will get called.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#issuecomment-1084878383


   @paleolimbot For the column filling, should that be the behavior of rbind, or an alternative row binding function? For example, I think tibble has separated behavior for each:
   
   ```r
   > library(dplyr)
   > rbind(tibble(x = 1:2), tibble(y = 2:4))
   Error in match.names(clabs, names(xi)) : 
     names do not match previous names
   > bind_rows(tibble(x = 1:2), tibble(y = 2:4))
   # A tibble: 5 × 2
         x     y
     <int> <int>
   1     1    NA
   2     2    NA
   3    NA     2
   4    NA     3
   5    NA     4
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12751: ARROW-15989: [R] rbind & cbind for Table & RecordBatch

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12751:
URL: https://github.com/apache/arrow/pull/12751#issuecomment-1085189273


   Got the `cbind` implementation to work with more types, so it's now more analogous to base `cbind`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org