You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2021/11/03 12:39:00 UTC
[jira] [Comment Edited] (ARROW-14071) [R] Try to arrow_eval
user-defined functions
[ https://issues.apache.org/jira/browse/ARROW-14071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438024#comment-17438024 ]
Dewey Dunnington edited comment on ARROW-14071 at 11/3/21, 12:38 PM:
---------------------------------------------------------------------
Reprex:
{code:java}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)nchar2 <- function(x) {
nchar(x)
}
RecordBatch$create(my_string = "1234") %>%
mutate(nchar(my_string), nchar2(my_string)) %>%
collect()
#> Warning: Expression nchar2(my_string) not supported in Arrow; pulling data into
#> R
#> # A tibble: 1 × 3
#> my_string `nchar(my_string)` `nchar2(my_string)`
#> <chr> <int> <int>
#> 1 1234 4 4{code}
I'm not sure if this works with the rlang data mask, but you could do this by setting `environment(fun)` to an environment that inherits the original `environment(fun)`. You probably don't want the data mask anyway because you don't want field references to interfere with the internal function variable names. (With apologies if you've done this already and I missed it):
{noformat}
masked_function <- function(fun, env) {
# probably want to (shallow) copy `env` because we'd need to modify it
# and it's passed by reference
env2 <- new.env(parent = environment(fun))
for (name in names(env)) {
env2[[name]] <- env[[name]]
}
environment(fun) <- env2
fun
}
some_var <- 45
my_function <- function() {
some_var + 5
}
my_function()
#> [1] 50
masked_function(my_function, as.environment(list(some_var = 1)))()
#> [1] 6{noformat}
was (Author: paleolimbot):
Reprex:
{{{color:#63a35c}library{color}(arrow, {color:#008080}warn.conflicts ={color} {color:#008080}FALSE{color}){color:#63a35c}library{color}(dplyr, {color:#008080}warn.conflicts ={color} {color:#008080}FALSE{color})nchar2 {color:#0086b3}<-{color} {color:#000000}function{color}(x) { {color:#63a35c}nchar{color}(x)}RecordBatch{color:#008080}${color}{color:#63a35c}create{color}({color:#008080}my_string ={color} {color:#183691}"1234"{color}) {color:#008080}%>%{color} {color:#63a35c}mutate{color}({color:#63a35c}nchar{color}(my_string), {color:#63a35c}nchar2{color}(my_string)) {color:#008080}%>%{color} {color:#63a35c}collect{color}(){color:#969896}#> Warning: Expression nchar2(my_string) not supported in Arrow; pulling data into{color}{color:#969896}#> R{color}{color:#969896}#> # A tibble: 1 × 3{color}{color:#969896}#> my_string `nchar(my_string)` `nchar2(my_string)`{color}{color:#969896}#> <chr> <int> <int>{color}{color:#969896}#> 1 1234 4 {color}}}
I'm not sure if this works with the rlang data mask, but you could do this by setting `environment(fun)` to an environment that inherits the original `environment(fun)`. You probably don't want the data mask anyway because you don't want field references to interfere with the internal function variable names. (With apologies if you've done this already and I missed it):
{{masked_function {color:#0086b3}<-{color} {color:#000000}function{color}(fun, env) { {color:#969896}# probably want to (shallow) copy `env` because we'd need to modify it{color} {color:#969896}# and it's passed by reference{color} env2 {color:#0086b3}<-{color} {color:#63a35c}new.env{color}({color:#008080}parent ={color} {color:#63a35c}environment{color}(fun)) {color:#000000}for{color} (name {color:#000000}in{color} {color:#63a35c}names{color}(env)) { env2[[name]] {color:#0086b3}<-{color} env[[name]] } {color:#63a35c}environment{color}(fun) {color:#0086b3}<-{color} env2 fun}some_var {color:#0086b3}<-{color} {color:#009999}45{color}my_function {color:#0086b3}<-{color} {color:#000000}function{color}() { some_var {color:#008080}+{color} {color:#009999}5{color}}{color:#63a35c}my_function{color}(){color:#969896}#> [1] 50{color}{color:#63a35c}masked_function{color}(my_function, {color:#63a35c}as.environment{color}({color:#63a35c}list{color}({color:#008080}some_var ={color} {color:#009999}1{color})))(){color:#969896}#> [1] 6{color}}}
> [R] Try to arrow_eval user-defined functions
> --------------------------------------------
>
> Key: ARROW-14071
> URL: https://issues.apache.org/jira/browse/ARROW-14071
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Neal Richardson
> Assignee: Dewey Dunnington
> Priority: Major
> Fix For: 7.0.0
>
>
> The first test passes but the second one fails, even though they're equivalent. The user's function isn't being evaluated in the nse_funcs environment.
> {code}
> expect_dplyr_equal(
> input %>%
> select(-fct) %>%
> filter(nchar(padded_strings) < 10) %>%
> collect(),
> tbl
> )
> isShortString <- function(x) nchar(x) < 10
> expect_dplyr_equal(
> input %>%
> select(-fct) %>%
> filter(isShortString(padded_strings)) %>%
> collect(),
> tbl
> )
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)