You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "dgreiss (via GitHub)" <gi...@apache.org> on 2023/06/11 16:32:48 UTC

[GitHub] [arrow-cookbook] dgreiss opened a new pull request, #311: Document window aggregates

dgreiss opened a new pull request, #311:
URL: https://github.com/apache/arrow-cookbook/pull/311

   apache/arrow#35779 and apache/arrow#35709 document how to perform window aggregates
   
   It was discussed on the mailing list: https://lists.apache.org/thread/b16ghtb8q9hyl64ks3dp9ftm7pvlnsdk to document this operation in the vignette and in the cookbook.  Now that these are documented in the vignettes I thought I'd add them to the cookbook. Let me know what you think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-cookbook] thisisnic merged pull request #311: [R] Document window aggregates

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic merged PR #311:
URL: https://github.com/apache/arrow-cookbook/pull/311


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-cookbook] dgreiss commented on pull request #311: [R] Document window aggregates

Posted by "dgreiss (via GitHub)" <gi...@apache.org>.
dgreiss commented on PR #311:
URL: https://github.com/apache/arrow-cookbook/pull/311#issuecomment-1595788563

   I added some tests, but the tests don't match the solutions 1 for 1. I can change the tests, to test for the same number of rows and the same mean, but I thought this was a better way to test that the dplyr and arrow tables are the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-cookbook] thisisnic commented on pull request #311: [R] Document window aggregates

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on PR #311:
URL: https://github.com/apache/arrow-cookbook/pull/311#issuecomment-1597641196

   > I added some tests, but the tests don't match the solutions 1 for 1. I can change the tests, to test for the same number of rows and the same mean, but I thought this was a better way to test that the dplyr and arrow tables are the same.
   
   Thanks for adding those!  Updating to check for rows and means would be great - this repo is a bit different from the main Arrow repo - we have tests there that check the output values match between arrow and dplyr, so here we only want to check that the outputs in the code snippets have the values we expect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-cookbook] dgreiss commented on pull request #311: [R] Document window aggregates

Posted by "dgreiss (via GitHub)" <gi...@apache.org>.
dgreiss commented on PR #311:
URL: https://github.com/apache/arrow-cookbook/pull/311#issuecomment-1586261901

   Awesome, just updated with your suggestions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-cookbook] thisisnic commented on pull request #311: [R] Document window aggregates

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on PR #311:
URL: https://github.com/apache/arrow-cookbook/pull/311#issuecomment-1592847245

   There are a few more things that need doing before I can merge this:
   1.  Add test for code blocks
   2. Add DuckDB to the dependencies in the `install_dependencies` script.
   
   Do you want to do these, or do you mind me pushing to your branch and I'll do it?  Happy either way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-cookbook] thisisnic commented on a diff in pull request #311: Document window aggregates

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on code in PR #311:
URL: https://github.com/apache/arrow-cookbook/pull/311#discussion_r1225864495


##########
r/content/tables.Rmd:
##########
@@ -317,4 +317,52 @@ the development version of the Arrow R package had been associated with their op
 classes.  However, as the Arrow C++ library's functionality extends, compute 
 functions may be added which do not yet have an R binding.  If you find a C++ 
 compute function which you wish to use from the R package, please [open an issue
-on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
\ No newline at end of file
+on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
+
+## Compute Window Aggregates
+
+Arrow does not support window functions, which can be problematic when applying an aggregate like `mean()` on a grouped table or within a rowwise operation like `filter()`:

Review Comment:
   Please could you rephrase this to describe the task that the user is trying to achieve (e.g. "you want to...")



##########
r/content/tables.Rmd:
##########
@@ -317,4 +317,52 @@ the development version of the Arrow R package had been associated with their op
 classes.  However, as the Arrow C++ library's functionality extends, compute 
 functions may be added which do not yet have an R binding.  If you find a C++ 
 compute function which you wish to use from the R package, please [open an issue
-on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
\ No newline at end of file
+on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
+
+## Compute Window Aggregates
+
+Arrow does not support window functions, which can be problematic when applying an aggregate like `mean()` on a grouped table or within a rowwise operation like `filter()`:
+
+```{r, arrow_window_aggregate}
+arrow_table(starwars) %>%
+  select(1:4) %>%
+  filter(!is.na(hair_color)) %>%
+  group_by(hair_color) %>%
+  filter(height < mean(height, na.rm = TRUE))
+```
+
+Arrow pulls the data into R, but for large tables, this sacrifices performance. You can perform these window aggregate operations on Arrow tables by:
+
+- Computing the aggregation separately, and join the result
+- Passing the data to DuckDB

Review Comment:
   In terms of following the formula used in other recipes, this would be better suited in the "discussion" section



##########
r/content/tables.Rmd:
##########
@@ -317,4 +317,52 @@ the development version of the Arrow R package had been associated with their op
 classes.  However, as the Arrow C++ library's functionality extends, compute 
 functions may be added which do not yet have an R binding.  If you find a C++ 
 compute function which you wish to use from the R package, please [open an issue
-on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
\ No newline at end of file
+on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).

Review Comment:
   Mind also updating this to refer to GH issues instead of JIRA?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-cookbook] thisisnic commented on a diff in pull request #311: [R] Document window aggregates

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on code in PR #311:
URL: https://github.com/apache/arrow-cookbook/pull/311#discussion_r1234387604


##########
r/scripts/install_dependencies.R:
##########
@@ -82,7 +82,7 @@ install_arrow_version <- function(version_to_install) {
 }
 
 dependencies <- c("testthat", "bookdown", "knitr", "purrr", "remotes", 
-                  "dplyr", "stringr", "reticulate")
+                  "dplyr", "stringr", "reticulate", "duckdb")

Review Comment:
   ```suggestion
                     "dplyr", "stringr", "reticulate", "duckdb", "dbplyr")
   ```
   
   I think we need this, due to failing test



##########
r/scripts/install_dependencies.R:
##########
@@ -82,7 +82,7 @@ install_arrow_version <- function(version_to_install) {
 }
 
 dependencies <- c("testthat", "bookdown", "knitr", "purrr", "remotes", 
-                  "dplyr", "stringr", "reticulate")
+                  "dplyr", "stringr", "reticulate", "duckdb")

Review Comment:
   ```suggestion
                     "dplyr", "stringr", "reticulate", "duckdb", "dbplyr")
   ```
   
   I think we need this, due to failing tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org