You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/08/18 18:57:00 UTC

[jira] [Created] (ARROW-17463) [R] Avoid unnecessary projections

Neal Richardson created ARROW-17463:
---------------------------------------

             Summary: [R] Avoid unnecessary projections
                 Key: ARROW-17463
                 URL: https://issues.apache.org/jira/browse/ARROW-17463
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Neal Richardson
            Assignee: Neal Richardson
             Fix For: 10.0.0


In ExecPlan$Build(), we call Project in a few places, and there is code to make sure that there is at least one ProjectNode in the query in order to remove augmented fields from a Dataset scan (unless the user has added them). As a result, it is possible to get multiple ProjectNodes in a row that are essentially no-op. One example is with grouped aggregation: there is a projection to get the order of the columns back to what R expects, and then a no-op projection after that:

{code}
> mtcars |> arrow_table() |> count(cyl) |> explain()
ExecPlan with 6 nodes:
5:SinkNode{}
  4:ProjectNode{projection=[cyl, n]}
    3:ProjectNode{projection=[cyl, n]}
      2:GroupByNode{keys=["cyl"], aggregates=[
      	hash_sum(n, {skip_nulls=true, min_count=1}),
      ]}
        1:ProjectNode{projection=["n": 1, cyl]}
          0:TableSourceNode{}
{code}

IDK how significant of a performance impact this would have, but it certainly looks wasteful and should be avoidable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)