You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/11 13:39:06 UTC
[GitHub] [arrow] nealrichardson commented on a diff in pull request #14371: ARROW-12105: [R] Replace vars_select, vars_rename with eval_select, eval_rename

nealrichardson commented on code in PR #14371:
URL: https://github.com/apache/arrow/pull/14371#discussion_r992302239


##########
r/R/arrow-package.R:
##########
@@ -27,7 +27,7 @@
 #' @importFrom rlang is_list call2 is_empty as_function as_label arg_match is_symbol is_call call_args
 #' @importFrom rlang quo_set_env quo_get_env is_formula quo_is_call f_rhs parse_expr f_env new_quosure
 #' @importFrom rlang new_quosures expr_text
-#' @importFrom tidyselect vars_pull vars_rename vars_select eval_select
+#' @importFrom tidyselect vars_pull vars_select eval_select eval_rename

Review Comment:
   Do we need to update the other usages of vars_select in the package too?



##########
r/R/util.R:
##########
@@ -251,3 +251,25 @@ augment_io_error_msg <- function(e, call, schema = NULL, format = NULL) {
   handle_augmented_field_misuse(msg, call)
   abort(msg, call = call)
 }
+
+simulate_data_frame <- function(schema) {
+

Review Comment:
   Add a comment explaining why we need this function.
   
   And do we need to export this? I think I saw some discussion between @paleolimbot and @krlmlr about needing this function in DBI or adbc.



##########
r/R/util.R:
##########
@@ -251,3 +251,25 @@ augment_io_error_msg <- function(e, call, schema = NULL, format = NULL) {
   handle_augmented_field_misuse(msg, call)
   abort(msg, call = call)
 }
+
+simulate_data_frame <- function(schema) {

Review Comment:
   Since this is going to be called every time someone does `select/rename/relocate`, I'd like for this function to be cheaper. In other PRs I've been noticing the overhead of creating R6 objects, which generally is not terrible (~150 microseconds on my machine) but it adds up. And here, we're creating lots of objects we're throwing away: for each column, we create a Field, then a DataType from that, then in concat_arrays, we create a null DataType, an Array with that, and a new Array that is cast to the correct DataType. That adds up to around 1ms per column, every time this function is called. That's enough to get noticed.
   
   Can we move this to C++? Should be a simple enough switch statement to map Arrow type ids to the corresponding R length-0 vector.
   
   



##########
r/R/dplyr-select.R:
##########
@@ -115,18 +90,39 @@ relocate.arrow_dplyr_query <- function(.data, ..., .before = NULL, .after = NULL
 }
 relocate.Dataset <- relocate.ArrowTabular <- relocate.RecordBatchReader <- relocate.arrow_dplyr_query
 
-check_select_helpers <- function(exprs) {
-  # Throw an error if unsupported tidyselect selection helpers in `exprs`
-  exprs <- lapply(exprs, function(x) if (is_quosure(x)) quo_get_expr(x) else x)
-  unsup_select_helpers <- "where"
-  funs_in_exprs <- unlist(lapply(exprs, all_funs))
-  unsup_funs <- funs_in_exprs[funs_in_exprs %in% unsup_select_helpers]
-  if (length(unsup_funs)) {
-    stop(
-      "Unsupported selection ",
-      ngettext(length(unsup_funs), "helper: ", "helpers: "),
-      oxford_paste(paste0(unsup_funs, "()"), quote = FALSE),
-      call. = FALSE
-    )
+column_select <- function(.data, select_expression, op = c("select", "rename")) {
+  op <- match.arg(op)
+
+  .data <- as_adq(.data)
+  sim_df <- simulate_data_frame(implicit_schema(.data))
+  old_names <- names(sim_df)
+
+  if (op == "select") {
+    out <- eval_select(expr(c(!!!select_expression)), sim_df)
+    # select only columns from `out`
+    subset <- out
+  } else if (op == "rename") {
+    out <- eval_rename(expr(c(!!!select_expression)), sim_df)
+    # select all columns as only renaming
+    subset <- set_names(seq_along(old_names), old_names)
+    names(subset)[out] <- names(out)
+  }
+
+  .data$selected_columns <- set_names(.data$selected_columns[subset], names(subset))
+
+  # check if names have updated
+  new_names <- old_names
+  new_names[out] <- names(out)
+  names_compared <- set_names(old_names, new_names)
+  renamed <- names_compared[names(names_compared) != names_compared]

Review Comment:
   Same thing but I think the intent is clearer:
   
   ```suggestion
     renamed <- names_compared[old_names != new_names]
   ```



##########
r/data-raw/docgen.R:
##########
@@ -127,10 +127,7 @@ docs <- arrow:::.cache$docs
 # Add some functions
 
 # across() is handled by manipulating the quosures, not by nse_funcs
-docs[["dplyr::across"]] <- c(
-  # TODO(ARROW-17384): implement where
-  "Use of `where()` selection helper not yet supported"
-)
+docs[["dplyr::across"]] <- character(0)

Review Comment:
   🎉 



##########
r/R/util.R:
##########
@@ -251,3 +251,25 @@ augment_io_error_msg <- function(e, call, schema = NULL, format = NULL) {
   handle_augmented_field_misuse(msg, call)
   abort(msg, call = call)
 }
+
+simulate_data_frame <- function(schema) {
+
+  arrays <- lapply(
+    schema$fields,
+    function(field) tryCatch(
+      concat_arrays(type = field$type),
+      error = function(...) concat_arrays(type = NULL)

Review Comment:
   Why would this error ever occur?



##########
r/R/util.R:
##########
@@ -251,3 +251,25 @@ augment_io_error_msg <- function(e, call, schema = NULL, format = NULL) {
   handle_augmented_field_misuse(msg, call)
   abort(msg, call = call)
 }
+
+simulate_data_frame <- function(schema) {
+
+  arrays <- lapply(
+    schema$fields,
+    function(field) tryCatch(
+      concat_arrays(type = field$type),
+      error = function(...) concat_arrays(type = NULL)
+    )
+  )
+
+  vectors <- lapply(
+    arrays,
+    function(array) tryCatch(
+      as.vector(array),
+      error = function(...) vctrs::unspecified()
+    )
+  )
+
+  names(vectors) <- names(schema)
+  tibble::new_tibble(vectors, nrow = 0)

Review Comment:
   ```suggestion
     vctrs::new_data_frame(vectors, nrow = 0)
   ```
   
   and we should import that not :: it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org