You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Ben Kietzman (Jira)" <ji...@apache.org> on 2022/08/17 14:44:00 UTC

[jira] [Created] (ARROW-17446) [R] Allow unrecognized R expressions to be callable as compute::Functions

Ben Kietzman created ARROW-17446:
------------------------------------

             Summary: [R] Allow unrecognized R expressions to be callable as compute::Functions
                 Key: ARROW-17446
                 URL: https://issues.apache.org/jira/browse/ARROW-17446
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
            Reporter: Ben Kietzman


Currently, if an R expression is not entirely supported by the arrow compute engine, the entire input will be pulled into memory for native R to operate on. It would be possible to instead provide add a custom compute function to the registry (inside {{R_init_arrow}}, probably) which evaluates any sub expressions which couldn't be translated to native arrow compute expressions.

This would for example allow a filter expression including a call to an R function {{baz}} to evaluate on a dataset larger than memory and with predicate and projection pushdown as normal using the expressions which *are* translatable. The resulting expression might look something like this in c++:

{code}
call("and_kleene", {
  call("greater", {field_ref("a"), scalar(1)}),
  call("r_expr", {field_ref("b")},
      /*options=*/RexprOptions{cpp11::function(baz_sexp)}),
});
{code}

In this case although the "r_expr" function is opaque to compute and datasets, we would still recognize that only fields "a" and "b" need to be materialized. Furthermore, the first member of the filter's conjunction is {{a > 1}}, which *is* translatable and could be used for predicate pushdown, for checking against parquet statistics, etc.

Since R is not multithreaded, the compute function would need to take a global lock to ensure only a single thread of R execution. This might be untenable since it would also lock the interpreter. Still, it seems like a worthwhile option to consider



--
This message was sent by Atlassian Jira
(v8.20.10#820010)