You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/06/25 17:18:00 UTC

[jira] [Commented] (ARROW-13186) [R] Implement type determination more cleanly

    [ https://issues.apache.org/jira/browse/ARROW-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369591#comment-17369591 ] 

Neal Richardson commented on ARROW-13186:
-----------------------------------------

I did some experimenting and got something that works for the arrow_mask/arrow_eval code paths, but any paths that use tidyselect::eval_select (currently only relocate but presumably others will be added) need slightly different handling and I didn't get the chance to work out a solution there yet.

The idea is that we stick the schema as a "data pronoun" like thing in the data mask, so that any functions called inside arrow_eval() can call up and find it. 

{code}
diff --git a/r/R/dplyr-eval.R b/r/R/dplyr-eval.R
index de68d2f2c..eda40dc23 100644
--- a/r/R/dplyr-eval.R
+++ b/r/R/dplyr-eval.R
@@ -86,9 +86,6 @@ arrow_mask <- function(.data) {
     f_env[[f]] <- fail
   }
 
-  # Assign the schema to the expressions
-  map(.data$selected_columns, ~(.$schema <- .data$.data$schema))
-
   # Add the column references and make the mask
   out <- new_data_mask(
     new_environment(.data$selected_columns, parent = f_env),
@@ -98,5 +95,18 @@ arrow_mask <- function(.data) {
   # TODO: figure out what rlang::as_data_pronoun does/why we should use it
   # (because if we do we get `Error: Can't modify the data pronoun` in mutate())
   out$.data <- .data$selected_columns
+  out$.schema <- .data$.data$schema
   out
 }
+
+arrow_eval_schema <- function() {
+  n <- 1
+  env <- parent.frame(n)
+  while(!identical(env, .GlobalEnv)) {
+    if (".schema" %in% ls(env, all.names = TRUE)) {
+      return(get(".schema", env))
+    }
+    n <- n + 1
+    env <- parent.frame(n)
+  }
+}
{code}

Then each of the is* functions calls arrow_eval_schema() to get it. 

The benefit of something like this is that we avoid the cost of tracking/merging schemas when building expressions and only have to grab it when we need it (which is rarely since none of the other compute kernels require it).

> [R] Implement type determination more cleanly
> ---------------------------------------------
>
>                 Key: ARROW-13186
>                 URL: https://issues.apache.org/jira/browse/ARROW-13186
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 5.0.0
>            Reporter: Ian Cook
>            Priority: Major
>
> In the R package, there are several improvements in data type determination in the 5.0.0 release. The implementation of these improvements used a kludge: They made it possible to store a {{Schema}} in an {{Expression}} object in the R package; when set, this {{Schema}} is retained in derivative {{Expression}} objects. This was the most convenient way to make the {{Schema}} available for passing it to the {{type_id()}} method, which requires it. But this introduces a deviation of the R package's {{Expression}} object from the C++ library's {{Expression}} object, and it makes our type determination functions work differently than the other R functions in {{nse_funcs}}.
> The Jira issues in which these somewhat kludgy improvements were made are:
>  * allowing a schema to be stored in the {{Expression}} object, and implementing type determination functions in a way that uses that schema (ARROW-12781)
>  * retaining a schema in derivative {{Expression}} objects (ARROW-13117)
>  * setting an empty schema in scalar literal {{Expression}} objects (ARROW-13119)
> From the perspective of the R package, an ideal way to implement type determination functions would be to call a {{type_id}} kernel through the {{call_function}} interface, but this was rejected in ARROW-13167. Consider other ways that we might improve this implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)