You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/14 18:43:41 UTC
[GitHub] [arrow] paleolimbot commented on pull request #12817: ARROW-15168: [R] Add S3 generics to create main Arrow objects

paleolimbot commented on PR #12817:
URL: https://github.com/apache/arrow/pull/12817#issuecomment-1099521412

   Ok! This is ready for a review. The main motivation here is to allow other packages to define conversions to Arrow objects. This is most useful for Table objects, since we currently convert to Table before `write_parquet/ipc_stream/feather/csv_arrow()` and because converting an R vector, and for Array objects, because the R Vector -> Array conversion is at the heart of the default conversion to Table. A motivating project for these methods is 'geoarrow', which implements an ExtensionType to store geometry columns (e.g., `sf::st_sfc()`) and table-level metadata for table-level objects (e.g., `sf::st_sf()`).
   
   Other methods I added here are important for lower-level packages like the work-in-progress 'substrait' and 'narrow', where there are analogues for `DataType` (e.g., the `narrow_schema()` or `substrait.Type`), `Schema` (e.g., the `narrow_schema()` or `substrait.NamedStruct`) , and `RecordBatchReader` (e.g., the `narrow_array_stream()`). In particular, I think that `as_record_batch_reader()` will be very useful (more flexible than `Table` for things like database results that can be streamed into the writers).
   
   Adding the S3 methods was reasonably straightforward; however, actually using them was quite complicated because R Vector -> Array conversion is highly optimized and mostly done in C++. The approach I took was to keep the existing code, changing as little as possible, for R objects that we handle internally. For other objects, the C++ code calls `type()` and `as_arrow_array()`. As is apparent based on the number of files changed by this PR, that "simple" approach lead to a whack-a-mole of test failures that I think I was able to solve with a minimal footprint; however, another option is to close this PR and tackle the changes in reverse order with smaller PRs.
   
   In a nutshell, before, users were stuck with the built-in behaviour for objects with a custom class. This was usually OK, but occasionally sub-optimal, non-functional, or corupting. I've picked a probably rare example because record-style vectors are new, but the current set of conversions assumes that they're list-like (as opposed to data-frame like):
   
   ``` r
   # install.packages("arrow")
   library(arrow, warn.conflicts = FALSE)
   
   tbl <- tibble::tibble(
     x = 1:5,
     points = wk::xy(x = 1:5, y = 6:10)
   )
   
   # wk::xy() is a record-style vector, like POSIXlt, with a vctrs implementation
   str(unclass(tbl$points))
   #> List of 2
   #>  $ x: num [1:5] 1 2 3 4 5
   #>  $ y: num [1:5] 6 7 8 9 10
   
   # in the release version this this fails
   Table$create(tbl)
   #> Error: Invalid: All columns must have the same length
   
   # ...or generates bogus output
   as.data.frame(Table$create(x = 1:2, points = tbl$points))
   #> Warning: Invalid metadata$r
   #> # A tibble: 2 × 2
   #>       x         points
   #>   <int> <list<double>>
   #> 1     1            [5]
   #> 2     2            [5]
   ```
   
   After this PR you can define `type()` and `as_arrow_array()` and the conversion should "just work". This is particularly useful in conjunction with the new extension type support, which can handle most (non-list-based) vctr classes (e.g., this PR removes the internal conversions for POSIXlt and haven types because the vctrs extension array handles them out of the box).
   
   ``` r
   # remotes::install_github("apache/arrow/r#12817")
   library(arrow, warn.conflicts = FALSE)
   
   tbl <- tibble::tibble(
     x = 1:5,
     points = wk::xy(x = 1:5, y = 6:10)
   )
   
   # wk::xy() is a record-style vector, like POSIXlt, with a vctrs implementation
   str(unclass(tbl$points))
   #> List of 2
   #>  $ x: num [1:5] 1 2 3 4 5
   #>  $ y: num [1:5] 6 7 8 9 10
   
   # this now fails:
   tf <- tempfile()
   write_feather(tbl, tf)
   #> Error:
   #> ! Can't infer Arrow data type from object inheriting from wk_xy / wk_rcrd
   
   # until...
   type.wk_xy <- function(x, ...) {
     vctrs_extension_type(vctrs::vec_ptype(x))
   }
   
   as_arrow_array.wk_xy <- function(x, ...) {
     vctrs_extension_array(x)
   }
   
   # now works!
   write_feather(tbl, tf)
   read_feather(tf)
   #> # A tibble: 5 × 2
   #>       x points 
   #>   <int> <wk_xy>
   #> 1     1 (1  6) 
   #> 2     2 (2  7) 
   #> 3     3 (3  8) 
   #> 4     4 (4  9) 
   #> 5     5 (5 10)
   
   # if for some reason the extension type is not loaded, we get the storage type
   # with no warning (maybe not ideal?)
   arrow::unregister_extension_type("arrow.r.vctrs")
   read_feather(tf)
   #> # A tibble: 5 × 2
   #>       x points$x    $y
   #>   <int>    <dbl> <dbl>
   #> 1     1        1     6
   #> 2     2        2     7
   #> 3     3        3     8
   #> 4     4        4     9
   #> 5     5        5    10
   ```
   
   <sup>Created on 2022-04-14 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org