You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/08/18 18:57:00 UTC
[jira] [Created] (ARROW-17463) [R] Avoid unnecessary projections
Neal Richardson created ARROW-17463:
---------------------------------------
Summary: [R] Avoid unnecessary projections
Key: ARROW-17463
URL: https://issues.apache.org/jira/browse/ARROW-17463
Project: Apache Arrow
Issue Type: Improvement
Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
Fix For: 10.0.0
In ExecPlan$Build(), we call Project in a few places, and there is code to make sure that there is at least one ProjectNode in the query in order to remove augmented fields from a Dataset scan (unless the user has added them). As a result, it is possible to get multiple ProjectNodes in a row that are essentially no-op. One example is with grouped aggregation: there is a projection to get the order of the columns back to what R expects, and then a no-op projection after that:
{code}
> mtcars |> arrow_table() |> count(cyl) |> explain()
ExecPlan with 6 nodes:
5:SinkNode{}
4:ProjectNode{projection=[cyl, n]}
3:ProjectNode{projection=[cyl, n]}
2:GroupByNode{keys=["cyl"], aggregates=[
hash_sum(n, {skip_nulls=true, min_count=1}),
]}
1:ProjectNode{projection=["n": 1, cyl]}
0:TableSourceNode{}
{code}
IDK how significant of a performance impact this would have, but it certainly looks wasteful and should be avoidable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)