You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/07 17:45:50 UTC

[GitHub] [arrow] paleolimbot commented on pull request #13397: ARROW-16444: [R] Implement user-defined scalar functions in R bindings

paleolimbot commented on PR #13397:
URL: https://github.com/apache/arrow/pull/13397#issuecomment-1177991132

   I *think* I've incorporated all the comments here - I've summarise the unresolved bits below but feel free to add to that list.
   
   I agree that the "the whole entire plan must be completely evaluated in one call into C++ from R" constraint is not ideal and I'm not offended if we want to bump this to the next release to see if we can do it better. It's a new feature and I think it's OK that we include it and let users give feedback on ways that user-defined functions can be improved (which may include support for the R-level record batch reader).
   
   I included improvements to `SafeCallIntoR<>()` / `RunWithCapturedR()` in this PR because it the like the bad error messages and code complexity of using them was becoming particularly evident. I'm happy to remove those changes and put them in another PR, too, since they widen the scope of this PR beyond just UDFs.
   
   A motivating example from the geospatial end of things that might be more fun to play with...it does highlight some of the complexities with matching extension types which is not all that well supported yet.
   
   <details>
   
   ``` r
   # remotes::install_github("apache/arrow#13397")
   library(arrow, warn.conflicts = FALSE)
   #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
   library(dplyr, warn.conflicts = FALSE)
   # remotes::install_github("paleolimbot/geoarrow")
   library(geoarrow)
   library(sf)
   #> Linking to GEOS 3.9.1, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE
   
   # (need a better generator for this in geoarrow)
   geoarrow_wkb_type_arrow <- arrow:::DataType$import_from_c(
     narrow::as_narrow_schema(geoarrow_wkb())
   )
   
   # scalar function wrapper
   st_perimeter_wrapper <- arrow_scalar_function(
     function(x) {
       sf::st_length(sf::st_boundary(sf::st_as_sfc(x)))
     },
     in_type = schema(x = geoarrow_wkb_type_arrow),
     out_type = float64()
   )
   
   # register!
   register_user_defined_function(st_perimeter_wrapper, "st_perimeter")
   
   # some example data
   nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
   # parameterized extension types (e.g., with crs) don't match the kernel signature
   sf::st_crs(nc) <- NA_crs_
   nc_table <- as_geoarrow_table(nc, schema = geoarrow_schema_wkb())
   
   # use in a pipeline
   nc_table |> 
     transmute(NAME, len = st_perimeter(geometry)) |> 
     collect()
   #> # A tibble: 100 × 2
   #>    NAME          len
   #>    <chr>       <dbl>
   #>  1 Ashe         1.44
   #>  2 Alleghany    1.23
   #>  3 Surry        1.63
   #>  4 Currituck    2.97
   #>  5 Northampton  2.21
   #>  6 Hertford     1.67
   #>  7 Camden       1.55
   #>  8 Gates        1.28
   #>  9 Warren       1.42
   #> 10 Stokes       1.43
   #> # … with 90 more rows
   
   # check answers
   nc |> 
     transmute(NAME, len = sf::st_length(sf::st_boundary(geometry)))
   #> Simple feature collection with 100 features and 2 fields
   #> Geometry type: MULTIPOLYGON
   #> Dimension:     XY
   #> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
   #> CRS:           NA
   #> # A tibble: 100 × 3
   #>    NAME          len                                                    geometry
   #>  * <chr>       <dbl>                                              <MULTIPOLYGON>
   #>  1 Ashe         1.44 (((-81.47276 36.23436, -81.54084 36.27251, -81.56198 36.27…
   #>  2 Alleghany    1.23 (((-81.23989 36.36536, -81.24069 36.37942, -81.26284 36.40…
   #>  3 Surry        1.63 (((-80.45634 36.24256, -80.47639 36.25473, -80.53688 36.25…
   #>  4 Currituck    2.97 (((-76.00897 36.3196, -76.01735 36.33773, -76.03288 36.335…
   #>  5 Northampton  2.21 (((-77.21767 36.24098, -77.23461 36.2146, -77.29861 36.211…
   #>  6 Hertford     1.67 (((-76.74506 36.23392, -76.98069 36.23024, -76.99475 36.23…
   #>  7 Camden       1.55 (((-76.00897 36.3196, -75.95718 36.19377, -75.98134 36.169…
   #>  8 Gates        1.28 (((-76.56251 36.34057, -76.60424 36.31498, -76.64822 36.31…
   #>  9 Warren       1.42 (((-78.30876 36.26004, -78.28293 36.29188, -78.32125 36.54…
   #> 10 Stokes       1.43 (((-80.02567 36.25023, -80.45301 36.25709, -80.43531 36.55…
   #> # … with 90 more rows
   ```
   
   <sup>Created on 2022-07-07 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org