You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/06/25 17:18:00 UTC
[jira] [Commented] (ARROW-13186) [R] Implement type determination
more cleanly
[ https://issues.apache.org/jira/browse/ARROW-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369591#comment-17369591 ]
Neal Richardson commented on ARROW-13186:
-----------------------------------------
I did some experimenting and got something that works for the arrow_mask/arrow_eval code paths, but any paths that use tidyselect::eval_select (currently only relocate but presumably others will be added) need slightly different handling and I didn't get the chance to work out a solution there yet.
The idea is that we stick the schema as a "data pronoun" like thing in the data mask, so that any functions called inside arrow_eval() can call up and find it.
{code}
diff --git a/r/R/dplyr-eval.R b/r/R/dplyr-eval.R
index de68d2f2c..eda40dc23 100644
--- a/r/R/dplyr-eval.R
+++ b/r/R/dplyr-eval.R
@@ -86,9 +86,6 @@ arrow_mask <- function(.data) {
f_env[[f]] <- fail
}
- # Assign the schema to the expressions
- map(.data$selected_columns, ~(.$schema <- .data$.data$schema))
-
# Add the column references and make the mask
out <- new_data_mask(
new_environment(.data$selected_columns, parent = f_env),
@@ -98,5 +95,18 @@ arrow_mask <- function(.data) {
# TODO: figure out what rlang::as_data_pronoun does/why we should use it
# (because if we do we get `Error: Can't modify the data pronoun` in mutate())
out$.data <- .data$selected_columns
+ out$.schema <- .data$.data$schema
out
}
+
+arrow_eval_schema <- function() {
+ n <- 1
+ env <- parent.frame(n)
+ while(!identical(env, .GlobalEnv)) {
+ if (".schema" %in% ls(env, all.names = TRUE)) {
+ return(get(".schema", env))
+ }
+ n <- n + 1
+ env <- parent.frame(n)
+ }
+}
{code}
Then each of the is* functions calls arrow_eval_schema() to get it.
The benefit of something like this is that we avoid the cost of tracking/merging schemas when building expressions and only have to grab it when we need it (which is rarely since none of the other compute kernels require it).
> [R] Implement type determination more cleanly
> ---------------------------------------------
>
> Key: ARROW-13186
> URL: https://issues.apache.org/jira/browse/ARROW-13186
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Affects Versions: 5.0.0
> Reporter: Ian Cook
> Priority: Major
>
> In the R package, there are several improvements in data type determination in the 5.0.0 release. The implementation of these improvements used a kludge: They made it possible to store a {{Schema}} in an {{Expression}} object in the R package; when set, this {{Schema}} is retained in derivative {{Expression}} objects. This was the most convenient way to make the {{Schema}} available for passing it to the {{type_id()}} method, which requires it. But this introduces a deviation of the R package's {{Expression}} object from the C++ library's {{Expression}} object, and it makes our type determination functions work differently than the other R functions in {{nse_funcs}}.
> The Jira issues in which these somewhat kludgy improvements were made are:
> * allowing a schema to be stored in the {{Expression}} object, and implementing type determination functions in a way that uses that schema (ARROW-12781)
> * retaining a schema in derivative {{Expression}} objects (ARROW-13117)
> * setting an empty schema in scalar literal {{Expression}} objects (ARROW-13119)
> From the perspective of the R package, an ideal way to implement type determination functions would be to call a {{type_id}} kernel through the {{call_function}} interface, but this was rejected in ARROW-13167. Consider other ways that we might improve this implementation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)