You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/28 12:18:40 UTC

[GitHub] [arrow] eitsupi commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

eitsupi commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1007998182


##########
r/vignettes/data_wrangling.Rmd:
##########
@@ -0,0 +1,172 @@
+---
+title: "Data analysis with dplyr syntax"
+description: >
+  Learn how to use the `dplyr` backend supplied by `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides a `dplyr` back end that allows users to manipulate tabular Arrow data (`Table` and `Dataset` objects) using familiar `dplyr` syntax. To use this functionality, make sure that the `arrow` and `dplyr` packages are both loaded. In this article we will take the `starwars` data set included in `dplyr`, convert it to an Arrow Table, and then analyze this data. Note that, although these examples all use an in-memory `Table` object, the same functionality works for an on-disk `Dataset` object with only minor differences in behavior (documented later in the article).
+
+To get started let's load the packages and create the data:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+sw <- arrow_table(starwars, as_data_frame = FALSE)
+```
+
+## One-table dplyr verbs
+
+The `arrow` package provides support for the `dplyr` one-table verbs, allowing users to construct data analysis pipelines in a familiar way. The example below shows the use of `filter()`, `rename()`, `mutate()`, `arrange()` and `select()`:
+
+```{r}
+result <- sw %>%
+  filter(homeworld == "Tatooine") %>%
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
+
+It is important to note that `arrow` users lazy evaluation to delay computation until the result is explicitly requested. This speeds up processing by enabling the Arrow C++ library to perform multiple computations in one operation. As a consequence of this design choice, we have not yet performed computations on the `sw` data have been performed. The `result` variable is an object with class `arrow_dplyr_query` that represents all the computations to be performed:
+
+```{r}
+result
+```
+
+To perform these computations and materialize the result, we call
+`compute()` or `collect()`. The difference between the two determines what kind of object will be returned. Calling `compute()` returns an Arrow Table, suitable for passing to other `arrow` or `dplyr` functions:
+
+```{r}
+compute(result)
+```
+
+In contrast, `collect()` returns an R data frame, suitable for viewing or passing to other R functions for analysis or visualization:
+
+```{r}
+collect(result)
+```
+
+The `arrow` package has broad support for single-table `dplyr` verbs, including those that compute aggregates. For example, it supports `group_by()` and `summarize()`, as well as commonly-used convenience functions such as `count()`:
+
+```{r}
+sw %>%
+  group_by(species) %>%
+  summarize(mean_height = mean(height, na.rm = TRUE)) %>%
+  collect()
+
+sw %>% 
+  count(gender) %>%
+  collect()
+```
+
+Note, however, that window functions such as `ntile()` are not yet supported. 
+
+## Two-table dplyr verbs
+
+Equality joins (e.g. `left_join()`, `inner_join()`) are supported for joining multiple tables. This is illustrated below:
+
+```{r}
+jedi <- data.frame(
+  name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
+  jedi = c(FALSE, TRUE, TRUE)
+)
+
+sw %>%
+  select(1:3) %>%
+  right_join(jedi) %>%
+  collect()
+```
+
+## Expressions within dplyr verbs
+
+Inside `dplyr` verbs, Arrow offers support for many functions and operators, with common functions mapped to their base R and tidyverse equivalents. The [changelog](https://arrow.apache.org/docs/r/news/index.html) lists many of them. If there are additional functions you would like to see implemented, please file an issue as described in the [Getting help](https://arrow.apache.org/docs/r/#getting-help) guidelines.
+
+## Registering custom bindings
+
+The `arrow` package makes it possible for users to supply bindings for custom functions in some situations using `register_scalar_function()`. To operate correctly, the to-be-registered function must have `context` as its first argument, as required by the query engine. For example, suppose we wanted to implement a function that converts a string to snake case (a greatly simplified version of `janitor::make_clean_names()`). The function could be written as follows:
+
+```{r}
+to_snake_name <- function(context, string) {
+  replace <- c(`'` = "", `"` = "", `-` = "", `\\.` = "_", ` ` = "_")
+  string %>% 
+    stringr::str_replace_all(replace) %>%
+    stringr::str_to_lower() %>% 
+    stringi::stri_trans_general(id = "Latin-ASCII")
+}
+```
+
+To call this within an `arrow`/`dplyr` pipeline, it needs to be registered:
+
+```{r}
+register_scalar_function(
+  name = "to_snake_name",
+  fun = to_snake_name,
+  in_type = utf8(),
+  out_type = utf8(),
+  auto_convert = TRUE
+)
+```
+
+In this expression, the `name` argument specifies the name by which it will be recognized in the context of the `arrow`/`dplyr` pipeline and `fun` is the function itself. The `in_type` and `out_type` arguments are used to specify the expected data type for the input and output, and `auto_convert` specifies whether `arrow` should automatically convert any R inputs to their Arrow equivalents. 
+
+Once registered, the following works:
+
+```{r}
+sw %>% 
+  transmute(name, snake_name = to_snake_name(name)) %>%

Review Comment:
   How about use `mutate(.keep = "none")` instead of `transmute()`?
   tidyverse/dplyr#6414



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org