You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ian Cook (Jira)" <ji...@apache.org> on 2021/06/22 02:51:00 UTC
[jira] [Comment Edited] (ARROW-13117) [R] Retain schema in new Expressions

    [ https://issues.apache.org/jira/browse/ARROW-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366952#comment-17366952 ] 

Ian Cook edited comment on ARROW-13117 at 6/22/21, 2:50 AM:
------------------------------------------------------------

The distinction between (1) and (2) is not significant. (1) is just (2) with scalar recycling. The same mechanisms underlie both.

The principle I had in mind with this PR is straightforward: An expression can optionally have a schema bound to it, and once a schema is bound to an expression, it will stay bound to _derivative_ expressions.

Before this PR, as soon as you create an expression inside a dplyr verb, you immediately lose the ability to know what its type is, despite the fact that it is fully knowable. You need to wait until {{arrow_mask()}} is run again to know it. In {{arrow_mask()}}, this command re-binds the schema to all the new expressions, so their types can be known again:
{code:r}
map(.data$selected_columns, ~(.$schema <- .data$.data$schema)){code}
After this PR, the types of expressions are known at all times. I think it is a clean solution; it's +7, -2 lines of actual package code; everything else is tests, docs, etc.

If the central objection here is that this causes the {{Expression}} object in the R package to deviate further from the {{Expression}} object in the C++ library—yes I agree; I wish there was a way to avoid that. Unfortunately {{compute::Expression}} in the C++ library lets you bind a schema to an expression, but the schema does not remain bound to derivative expressions. We already deviated somewhat by storing the schema in the {{Expression}} object in ARROW-12781. This PR takes better advantage of the fact that we're doing that.

I experimented with trying to achieve the changes in this PR through changes in {{expression.cpp}}, instead of to the R code, but that did not work because the problem is due to limitations in what is possible with {{compute::Expression}}, not limitations in how it is exposed to R.


was (Author: icook):
The distinction between (1) and (2) is not significant. (1) is just (2) with scalar recycling. The same mechanisms underlie both.

The principle I had in mind with this PR is straightforward: Expressions can optionally have schemas bound to them, and once a schema is bound to an expression, it will stay bound to derivative expressions.

Before this PR, as soon as you create an expression inside a dplyr verb, you immediately lose the ability to know what its type is, despite the fact that it is fully knowable. You need to wait until {{arrow_mask()}} is run again to know it. In {{arrow_mask()}}, this command re-binds the schema to all the new expressions, so their types can be known again:
{code:r}
map(.data$selected_columns, ~(.$schema <- .data$.data$schema)){code}
After this PR, the types of expressions are known at all times. I think it is a clean solution; it's +7, -2 lines of actual package code; everything else is tests, docs, etc.

If the central objection here is that this causes the {{Expression}} object in the R package to deviate further from the {{Expression}} object in the C++ library—yes I agree; I wish there was a way to avoid that. Unfortunately {{compute::Expression}} in the C++ library lets you bind a schema to an expression, but the schema does not remain bound to derivative expressions. We already deviated somewhat by storing the schema in the {{Expression}} object in ARROW-12781. This PR takes better advantage of the fact that we're doing that.

I experimented with trying to achieve the changes in this PR through changes in {{expression.cpp}}, instead of to the R code, but that did not work because the problem is due to limitations in what is possible with {{compute::Expression}}, not limitations in how it is exposed to R.

> [R] Retain schema in new Expressions
> ------------------------------------
>
>                 Key: ARROW-13117
>                 URL: https://issues.apache.org/jira/browse/ARROW-13117
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Ian Cook
>            Assignee: Ian Cook
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 5.0.0
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> When a new Expression is created, {{schema}} should be retained from the expression(s) it was created from. That way, the {{type()}} and {{type_id()}} methods of the new Expression will work. For example, currently this happens:
> {code:r}
> > x <- Expression$field_ref("x")
> > x$schema <- Schema$create(x = int32())
> > 
> > y <- Expression$field_ref("y")
> > y$schema <- Schema$create(y = int32())
> > 
> > Expression$create("add_checked", x, y)$type()
> Error: !is.null(schema) is not TRUE {code}
> This is what we want to happen:
> {code:r}
> > Expression$create("add_checked", x, y)$type()
> Int32
> int32
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)