You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2021/11/03 12:39:00 UTC

[jira] [Comment Edited] (ARROW-14071) [R] Try to arrow_eval user-defined functions

    [ https://issues.apache.org/jira/browse/ARROW-14071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438024#comment-17438024 ] 

Dewey Dunnington edited comment on ARROW-14071 at 11/3/21, 12:38 PM:
---------------------------------------------------------------------

Reprex: 

 
{code:java}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)nchar2 <- function(x) {
  nchar(x)
}
RecordBatch$create(my_string = "1234") %>%
  mutate(nchar(my_string), nchar2(my_string)) %>%
  collect()
#> Warning: Expression nchar2(my_string) not supported in Arrow; pulling data into
#> R
#> # A tibble: 1 × 3
#>   my_string `nchar(my_string)` `nchar2(my_string)`
#>   <chr>                  <int>               <int>
#> 1 1234                       4                   4{code}
 I'm not sure if this works with the rlang data mask, but you could do this by setting `environment(fun)` to an environment that inherits the original `environment(fun)`. You probably don't want the data mask anyway because you don't want field references to interfere with the internal function variable names. (With apologies if you've done this already and I missed it):
{noformat}
masked_function <- function(fun, env) {
  # probably want to (shallow) copy `env` because we'd need to modify it
  # and it's passed by reference
  env2 <- new.env(parent = environment(fun))
  for (name in names(env)) {
    env2[[name]] <- env[[name]]
  }
  
  environment(fun) <- env2
  fun
}

some_var <- 45
my_function <- function() {
  some_var + 5
}

my_function()
#> [1] 50
masked_function(my_function, as.environment(list(some_var = 1)))()
#> [1] 6{noformat}


was (Author: paleolimbot):
Reprex: 

{{{color:#63a35c}library{color}(arrow, {color:#008080}warn.conflicts ={color} {color:#008080}FALSE{color}){color:#63a35c}library{color}(dplyr, {color:#008080}warn.conflicts ={color} {color:#008080}FALSE{color})nchar2 {color:#0086b3}<-{color} {color:#000000}function{color}(x) {  {color:#63a35c}nchar{color}(x)}RecordBatch{color:#008080}${color}{color:#63a35c}create{color}({color:#008080}my_string ={color} {color:#183691}"1234"{color}) {color:#008080}%>%{color}  {color:#63a35c}mutate{color}({color:#63a35c}nchar{color}(my_string), {color:#63a35c}nchar2{color}(my_string)) {color:#008080}%>%{color}  {color:#63a35c}collect{color}(){color:#969896}#> Warning: Expression nchar2(my_string) not supported in Arrow; pulling data into{color}{color:#969896}#> R{color}{color:#969896}#> # A tibble: 1 × 3{color}{color:#969896}#>   my_string `nchar(my_string)` `nchar2(my_string)`{color}{color:#969896}#>   <chr>                  <int>               <int>{color}{color:#969896}#> 1 1234                       4    {color}}}

 

I'm not sure if this works with the rlang data mask, but you could do this by setting `environment(fun)` to an environment that inherits the original `environment(fun)`. You probably don't want the data mask anyway because you don't want field references to interfere with the internal function variable names. (With apologies if you've done this already and I missed it):

{{masked_function {color:#0086b3}<-{color} {color:#000000}function{color}(fun, env) {  {color:#969896}# probably want to (shallow) copy `env` because we'd need to modify it{color}  {color:#969896}# and it's passed by reference{color}  env2 {color:#0086b3}<-{color} {color:#63a35c}new.env{color}({color:#008080}parent ={color} {color:#63a35c}environment{color}(fun))  {color:#000000}for{color} (name {color:#000000}in{color} {color:#63a35c}names{color}(env)) {    env2[[name]] {color:#0086b3}<-{color} env[[name]]  }    {color:#63a35c}environment{color}(fun) {color:#0086b3}<-{color} env2  fun}some_var {color:#0086b3}<-{color} {color:#009999}45{color}my_function {color:#0086b3}<-{color} {color:#000000}function{color}() {  some_var {color:#008080}+{color} {color:#009999}5{color}}{color:#63a35c}my_function{color}(){color:#969896}#> [1] 50{color}{color:#63a35c}masked_function{color}(my_function, {color:#63a35c}as.environment{color}({color:#63a35c}list{color}({color:#008080}some_var ={color} {color:#009999}1{color})))(){color:#969896}#> [1] 6{color}}}

> [R] Try to arrow_eval user-defined functions
> --------------------------------------------
>
>                 Key: ARROW-14071
>                 URL: https://issues.apache.org/jira/browse/ARROW-14071
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Neal Richardson
>            Assignee: Dewey Dunnington
>            Priority: Major
>             Fix For: 7.0.0
>
>
> The first test passes but the second one fails, even though they're equivalent. The user's function isn't being evaluated in the nse_funcs environment.
> {code}
>   expect_dplyr_equal(
>     input %>%
>       select(-fct) %>%
>       filter(nchar(padded_strings) < 10) %>%
>       collect(),
>     tbl
>   )
>   isShortString <- function(x) nchar(x) < 10
>   expect_dplyr_equal(
>     input %>%
>       select(-fct) %>%
>       filter(isShortString(padded_strings)) %>%
>       collect(),
>     tbl
>   )
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)