You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/10/13 14:16:00 UTC

[jira] [Updated] (ARROW-17462) [R] Cast scalars to type of field in Expression building

     [ https://issues.apache.org/jira/browse/ARROW-17462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson updated ARROW-17462:
------------------------------------
    Fix Version/s: 11.0.0
                       (was: 10.0.0)

> [R] Cast scalars to type of field in Expression building
> --------------------------------------------------------
>
>                 Key: ARROW-17462
>                 URL: https://issues.apache.org/jira/browse/ARROW-17462
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Neal Richardson
>            Assignee: Neal Richardson
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 11.0.0
>
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> After looking at the ExecPlan output of some queries, it jumped out at me how we translate {{ int_field == 5 }} in R as {{ cast(int_field, float64) == 5 }} because 5 is a double in R. 
> This extra work has a noticeable performance impact. Here's a simple query on the taxi dataset, filtering down to 54 out of 1.5 billion rows and selecting a single column. My idea was to make a query that does not much other than evaluate the filter. 
> {code}
> > system.time(ds |> select(passenger_count) |> filter(passenger_count > 10) |> compute())
>    user  system elapsed 
>   0.391   0.024   0.362 
> > system.time(ds |> select(passenger_count) |> filter(passenger_count > Scalar$create(10, type = int8())) |> compute())
>    user  system elapsed 
>   0.206   0.025   0.179 
> {code}
> You can see the difference in the query plans too:
> {code}
> > ds |> select(passenger_count) |> filter(passenger_count > 10) |> explain()
> ExecPlan with 4 nodes:
> 3:SinkNode{}
>   2:ProjectNode{projection=[passenger_count]}
>     1:FilterNode{filter=(cast(passenger_count, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}) > 10)}
>       0:SourceNode{}
> > ds |> select(passenger_count) |> filter(passenger_count > Scalar$create(10, type = int8())) |> explain()
> ExecPlan with 4 nodes:
> 3:SinkNode{}
>   2:ProjectNode{projection=[passenger_count]}
>     1:FilterNode{filter=(passenger_count > 10)}
>       0:SourceNode{}
> {code}
> Ideally Acero would do this more intelligently (cf. ARROW-11402), but we should also be able to do smarter things when assembling the Expression in R.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)