You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "ianmcook (via GitHub)" <gi...@apache.org> on 2023/02/08 21:03:37 UTC

[GitHub] [arrow] ianmcook commented on a diff in pull request #33917: GH-33892: [R] Map `dplyr::n()` to `count_all` kernel

ianmcook commented on code in PR #33917:
URL: https://github.com/apache/arrow/pull/33917#discussion_r1100677454


##########
r/R/query-engine.R:
##########
@@ -101,7 +101,11 @@ ExecPlan <- R6Class("ExecPlan",
         # plus group_by_vars (last)
         # TODO: validate that none of names(aggregations) are the same as names(group_by_vars)
         # dplyr does not error on this but the result it gives isn't great
-        node <- node$Project(summarize_projection(.data))
+        projection <- summarize_projection(.data)
+        # skip projection if no grouping and all aggregate functions are nullary

Review Comment:
   My understanding is that adding a project node before an aggregate node will never improve performance, and that the only real reason to do it is to ensure that the aggregate node has all the columns it needs.
   
   I just checked with Weston about this. Here's our conversation:
   
   **Ian:**
   Which of these two ExecPlans would execute the most quickly and efficiently?
   
   1. Read source data containing columns A,B,C,D --> project to only columns A and B --> aggregate sum of A over groups in B --> output to table
   
   2. Read source data containing columns A,B,C,D --> aggregate sum of A over groups in B --> output to table
   
   In other words, does adding an unnecessary projection help, hurt, or not make any difference?
   
   **Weston**:
   I assume by "project to" you mean an actual project node and not a push down projection? In reality you'd want to avoid reading C and D from disk so you'd want the projection in the read.
   
   **Ian**:
   correct, actual project node
   
   **Weston**:
   An actual project node, with just field references, is effectively free.
   
   Now, if the follow-up node was a filter node, then you might be paying more for plan 2.  We don't use selection vectors so the filter node would have to materialize the data and it would have to materialize your two junk columns.
   
   The aggregate node, on the other hand, doesn't output any input columns.  So I'd expect it will throw them away in it's InputReceived and the performance of 1 & 2 will be more or less the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org