You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/10 13:43:36 UTC

[GitHub] [arrow] dragosmg commented on a change in pull request #11915: ARROW-13834: [R][Documentation] Document the process of creating R bindings for compute kernels and rationale behind conventions

dragosmg commented on a change in pull request #11915:
URL: https://github.com/apache/arrow/pull/11915#discussion_r766687872



##########
File path: r/vignettes/developers/bindings.Rmd
##########
@@ -0,0 +1,189 @@
+When writing bindings between C++ compute functions and R functions, the aim is 
+to expose the C++ functionality via existing R functions.  The syntax and 
+functionality should (usually) exactly match that of the existing R functions 
+(though with some exceptions) so that users are able to use existing tidyverse 
+or base R syntax, or call existing S3 methods on objects, whilst taking 
+advantage of the speed and functionality of the underlying arrow package.
+
+# Implementing bindings for S3 generics
+
+# Implementing bindings to work within dplyr pipelines
+
+One of main ways in which users interact with arrow is via dplyr syntax called 
+on Arrow objects.  For example, when a user calls `dplyr::mutate()` on an Arrow Tabular, 
+Dataset, or arrow data query object, the Arrow implementation of `mutate()` is 
+used and under the hood, translates the dplyr code into Arrow C++ code.
+
+When using `dplyr::mutate()` or `dplyr::filter()`, you may want to use functions
+from other packages, e.g. 
+
+```{r}
+library(dplyr)
+library(stringr)
+starwars %>%
+  filter(str_detect(name, "Darth"))
+```
+This functionality has also been implemented in Arrow, e.g.:
+
+```{r}
+library(arrow)
+arrow_table(starwars) %>%
+  filter(str_detect(name, "Darth")) %>%
+  collect()
+```
+
+This is possible as a **binding** has been created between the stringr function
+`str_detect()` and the Arrow C++ function `match_substring_regex`.  You can see 
+this for yourself by inspecting the arrow data query object without retrieving the 
+results via `collect()`.
+
+```{r}
+arrow_table(starwars) %>%
+  filter(str_detect(name, "Darth")) 
+```
+
+In the following sections, we'll walk through how to create a binding between an 
+R function and an Arrow C++ function.
+
+## Walkthrough
+
+Imagine you are writing the bindings for the C++ function 
+[`starts_with()`](https://arrow.apache.org/docs/cpp/compute.html#containment-tests) 
+and want to bind it to the (base) R function `startsWith()`.
+
+First, let's take a look at the docs for both of those functions.
+
+First, here’s the docs for R’s `startsWith()` (also available at https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html)
+
+```{r, echo=FALSE, out.width="50%"}
+knitr::include_graphics("./startswithdocs.png")
+```
+
+It takes 2 parameters; `x` - the input, and `prefix` - the characters to check 
+if `x` starts with.
+
+Now, let’s go to 
+[the compute function documentation](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
+and look for the Arrow C++ library’s `starts_with()` function:
+
+```{r, echo=FALSE, out.width="50%"}
+knitr::include_graphics("./starts_with_docs.png")
+```
+We can see that `starts_with()` is a unary function, which means that it takes a
+single data input. The data input must be a string-like class, and the returned 
+value is boolean, both of which match up to R’s `startsWith()`.
+
+There is an options class associated with `starts_with()` - called [`MatchSubstringOptions`](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute21MatchSubstringOptionsE)
+- so let’s take a look at that.
+
+```{r, echo=FALSE, out.width="50%"}
+knitr::include_graphics("./matchsubstringoptions.png")
+```
+
+Options classes allow the user to control the behaviour of the function.  In 
+this case, there are two possible options which can be supplied - `pattern` and 
+`ignore_case`, which are described in the docs shown above.
+
+What conclusions can be drawn from what we’ve seen so far?
+
+Base R’s `startsWith()` and Arrow’s `starts_with()` operate on equivalent data 
+types, return equivalent data types, and as there are no options implemented in 
+R that Arrow doesn’t have, this should be fairly simple to map without a great 
+deal of extra work.  
+
+As `starts_with()` has an options class associated with it, we’ll need to make 
+sure that it’s linked up with this in the R code.
+
+So let's get started.
+
+### Step 1 - add unit tests
+
+Look up the R function that you want to bind the compute kernel to, and write a 
+set of unit tests that use a dplyr pipeline and `compare_dplyr_binding()` (and 
+perhaps even `compare_dplyr_error()` if necessary.  These functions compare the 
+output of the original function with the dplyr bindings and make sure they match.
+
+Make sure you’re testing all parameters of the R function.
+
+Below is a possible example test for `startsWith()`.
+
+```{r, eval = FALSE}
+test_that("startsWith", {
+  df <- tibble(x = c("Foo", "bar", "baz", "qux"))
+ 
+  compare_dplyr_binding(
+    .input %>%
+        filter(startsWith(x, "b")) %>%
+        collect(),
+    df
+  )
+
+}

Review comment:
       I think you are missing a closing bracket `)` here. The closing for `test_that(`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org