You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/25 14:33:31 UTC
[GitHub] [arrow] ianmcook commented on a change in pull request #10992: ARROW-13740: [R] summarize() should not eagerly evaluate

ianmcook commented on a change in pull request #10992:
URL: https://github.com/apache/arrow/pull/10992#discussion_r695814872



##########
File path: r/R/query-engine.R
##########
@@ -42,11 +55,73 @@ ExecPlan <- R6Class("ExecPlan",
       }
       # ScanNode needs the filter to do predicate pushdown and skip partitions,
       # and it needs to know which fields to materialize (and which are unnecessary)
-      ExecNode_Scan(self, dataset, filter, colnames)
+      ExecNode_Scan(self, dataset, filter, colnames %||% character(0))
+    },
+    Build = function(.data) {
+      # This method takes an arrow_dplyr_query and chains together the
+      # ExecNodes that they produce. It does not evaluate them--that is Run().
+      group_vars <- dplyr::group_vars(.data)
+      grouped <- length(group_vars) > 0
+
+      # Collect the target names first because we have to add back the group vars
+      target_names <- names(.data)
+      .data <- ensure_group_vars(.data)
+      .data <- ensure_arrange_vars(.data) # this sets .data$temp_columns
+
+      node <- self$Scan(.data)
+      # ARROW-13498: Even though Scan takes the filter, apparently we have to do it again
+      if (inherits(.data$filtered_rows, "Expression")) {
+        node <- node$Filter(.data$filtered_rows)
+      }
+      # If any columns are derived we need to Project (otherwise this may be no-op)
+      node <- node$Project(c(.data$selected_columns, .data$temp_columns))
+
+      if (length(.data$aggregations)) {
+        if (grouped) {
+          # We need to prefix all of the aggregation function names with "hash_"
+          .data$aggregations <- lapply(.data$aggregations, function(x) {
+            x[["fun"]] <- paste0("hash_", x[["fun"]])
+            x
+          })
+        }
+
+        node <- node$Aggregate(
+          options = .data$aggregations,
+          target_names = target_names,
+          out_field_names = names(.data$aggregations),
+          key_names = group_vars
+        )
+
+        if (grouped) {
+          # The result will have result columns first then the grouping cols.
+          # dplyr orders group cols first, so adapt the result to meet that expectation.

Review comment:
       Although dplyr sorts results by the grouping columns, I do not think it is an improvement to have arrow do that by default. Databases in general do not provide any guarantees that results will be sorted by grouping columns because sorting causes much worse performance on queries that group by high-cardinality columns. At a minimum I think we should provide an option that enables users to toggle this behavior on or off.
   
   
   I think we should provide an option to toggle on/off the behavior to sort the result by group columns. dplyr and some SQL engines do this, but other SQL engines do not, and 

##########
File path: r/R/query-engine.R
##########
@@ -42,11 +55,73 @@ ExecPlan <- R6Class("ExecPlan",
       }
       # ScanNode needs the filter to do predicate pushdown and skip partitions,
       # and it needs to know which fields to materialize (and which are unnecessary)
-      ExecNode_Scan(self, dataset, filter, colnames)
+      ExecNode_Scan(self, dataset, filter, colnames %||% character(0))
+    },
+    Build = function(.data) {
+      # This method takes an arrow_dplyr_query and chains together the
+      # ExecNodes that they produce. It does not evaluate them--that is Run().
+      group_vars <- dplyr::group_vars(.data)
+      grouped <- length(group_vars) > 0
+
+      # Collect the target names first because we have to add back the group vars
+      target_names <- names(.data)
+      .data <- ensure_group_vars(.data)
+      .data <- ensure_arrange_vars(.data) # this sets .data$temp_columns
+
+      node <- self$Scan(.data)
+      # ARROW-13498: Even though Scan takes the filter, apparently we have to do it again
+      if (inherits(.data$filtered_rows, "Expression")) {
+        node <- node$Filter(.data$filtered_rows)
+      }
+      # If any columns are derived we need to Project (otherwise this may be no-op)
+      node <- node$Project(c(.data$selected_columns, .data$temp_columns))
+
+      if (length(.data$aggregations)) {
+        if (grouped) {
+          # We need to prefix all of the aggregation function names with "hash_"
+          .data$aggregations <- lapply(.data$aggregations, function(x) {
+            x[["fun"]] <- paste0("hash_", x[["fun"]])
+            x
+          })
+        }
+
+        node <- node$Aggregate(
+          options = .data$aggregations,
+          target_names = target_names,
+          out_field_names = names(.data$aggregations),
+          key_names = group_vars
+        )
+
+        if (grouped) {
+          # The result will have result columns first then the grouping cols.
+          # dplyr orders group cols first, so adapt the result to meet that expectation.

Review comment:
       Although dplyr sorts results by the grouping columns, I do not think it is an improvement to have arrow do that by default. Databases in general do not provide any guarantees that results will be sorted by grouping columns because sorting causes much worse performance on queries that group by high-cardinality columns. At a minimum I think we should provide an option that enables users to toggle this behavior on or off.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org