You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by th...@apache.org on 2023/06/21 15:53:28 UTC
[arrow-cookbook] branch main updated: [R] Document window aggregates (#311)

This is an automated email from the ASF dual-hosted git repository.

thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git


The following commit(s) were added to refs/heads/main by this push:
     new 2b473ce  [R] Document window aggregates (#311)
2b473ce is described below

commit 2b473ce0391af2cdddcc4b00445fa3c51f47c27b
Author: David Greiss <dg...@users.noreply.github.com>
AuthorDate: Wed Jun 21 11:53:24 2023 -0400

    [R] Document window aggregates (#311)
---
 r/content/tables.Rmd             | 100 ++++++++++++++++++++++++++++++++++++++-
 r/scripts/install_dependencies.R |   2 +-
 2 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/r/content/tables.Rmd b/r/content/tables.Rmd
index 75078c2..89225ab 100644
--- a/r/content/tables.Rmd
+++ b/r/content/tables.Rmd
@@ -317,4 +317,102 @@ the development version of the Arrow R package had been associated with their op
 classes.  However, as the Arrow C++ library's functionality extends, compute 
 functions may be added which do not yet have an R binding.  If you find a C++ 
 compute function which you wish to use from the R package, please [open an issue
-on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
\ No newline at end of file
+on the Github project](https://github.com/apache/arrow/issues).
+
+## Compute Window Aggregates
+
+You want to apply an aggregation (e.g. `mean()`) on a grouped table or within a rowwise operation like `filter()`:
+
+### Solution
+
+```{r, arrow_window_aggregate_joins}
+arrow_table(starwars) %>%
+  select(1:4) %>%
+  filter(!is.na(hair_color)) %>%
+  left_join(
+    arrow_table(starwars) %>%
+      group_by(hair_color) %>%
+      summarize(mean_height = mean(height, na.rm = TRUE))
+  ) %>%
+  filter(height < mean_height) %>%
+  select(!mean_height) %>%
+  collect()
+```
+
+```{r, test_arrow_window_aggregate_joins, opts.label = "test"}
+test_that("arrow_window_aggregate_joins", {
+  out <- arrow_table(starwars) %>%
+    select(1:4) %>%
+    filter(!is.na(hair_color)) %>%
+    left_join(
+      arrow_table(starwars) %>%
+        group_by(hair_color) %>%
+        summarize(mean_height = mean(height, na.rm = TRUE))
+    ) %>%
+    filter(height < mean_height) %>%
+    select(!mean_height) %>%
+    collect()
+
+  expected <- starwars %>%
+    select(name, height, hair_color) %>%
+    filter(!is.na(hair_color)) %>%
+    group_by(hair_color) %>%
+    filter(height < mean(height, na.rm = TRUE))
+
+  expect_equal(nrow(out), nrow(expected))
+  expect_equal(sum(out$height), sum(expected$height))
+})
+```
+
+Or using `to_duckdb()`
+```{r, arrow_window_aggregate_duckdb}
+arrow_table(starwars) %>%
+  select(1:4) %>%
+  filter(!is.na(hair_color)) %>%
+  to_duckdb() %>%
+  group_by(hair_color) %>%
+  filter(height < mean(height, na.rm = TRUE)) %>%
+  to_arrow() %>%
+  collect()
+```
+
+```{r, test_arrow_window_aggregate_duckdb, opts.label = "test"}
+test_that("arrow_window_aggregate_duckdb", {
+  out <- arrow_table(starwars) %>%
+    select(1:4) %>%
+    filter(!is.na(hair_color)) %>%
+    to_duckdb() %>%
+    group_by(hair_color) %>%
+    filter(height < mean(height, na.rm = TRUE)) %>%
+    to_arrow() %>%
+    collect()
+
+  expected <- starwars %>%
+    select(name, height, hair_color) %>%
+    filter(!is.na(hair_color)) %>%
+    group_by(hair_color) %>%
+    filter(height < mean(height, na.rm = TRUE))
+
+  expect_equal(nrow(out), nrow(expected))
+  expect_equal(sum(out$height), sum(expected$height))
+})
+```
+
+### Discusson
+
+Arrow does not support window functions, and pulls the data into R. For large tables, this sacrifices performance. 
+
+```{r, arrow_window_aggregate}
+arrow_table(starwars) %>%
+  select(1:4) %>%
+  filter(!is.na(hair_color)) %>%
+  group_by(hair_color) %>%
+  filter(height < mean(height, na.rm = TRUE))
+```
+
+You can perform these window aggregate operations on Arrow tables by:
+
+- Computing the aggregation separately, and joining the result
+- Passing the data to DuckDB, and use the DuckDB query engine to perform the operations
+
+Arrow supports zero-copy integration with DuckDB, and DuckDB can query Arrow datasets directly and stream query results back to Arrow. This integreation uses zero-copy streaming of data between DuckDB and Arrow and vice versa so that you can compose a query using both together, all the while not paying any cost to (re)serialize the data when you pass it back and forth. This is especially useful in cases where something is supported in one of Arrow or DuckDB query engines but not the othe [...]
\ No newline at end of file
diff --git a/r/scripts/install_dependencies.R b/r/scripts/install_dependencies.R
index 60eec6d..5daec6a 100644
--- a/r/scripts/install_dependencies.R
+++ b/r/scripts/install_dependencies.R
@@ -82,7 +82,7 @@ install_arrow_version <- function(version_to_install) {
 }
 
 dependencies <- c("testthat", "bookdown", "knitr", "purrr", "remotes", 
-                  "dplyr", "stringr", "reticulate")
+                  "dplyr", "stringr", "reticulate", "duckdb", "dbplyr")
 
 for (dependency in dependencies) {
   load_package(dependency)