You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/05/26 16:47:00 UTC

[jira] [Created] (ARROW-16670) [R] Behaviour of R-specific key/value metadata in the query engine

Dewey Dunnington created ARROW-16670:
----------------------------------------

             Summary: [R] Behaviour of R-specific key/value metadata in the query engine
                 Key: ARROW-16670
                 URL: https://issues.apache.org/jira/browse/ARROW-16670
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Dewey Dunnington


In ARROW-16607 there are some changes to metadata handling in the {{arrow_dplyr_query}}. With extension type support, more column types (like sf::sfc) can be supported, and with growing support for column types comes a greater chance that our current metadata restoration by default policy will cause difficult-to-work-around errors. The latest one I have run across is this one:

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
# required for write_dataset(nc) to work
# remotes::install_github("paleolimbot/geoarrow")
library(geoarrow)
library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE

nc <- read_sf(system.file("shape/nc.shp", package = "sf"))
tf <- tempfile()
write_dataset(nc, tf)

open_dataset(tf) %>% 
  select(NAME, FIPS) %>% 
  collect()
#> Error in st_geometry.sf(x): attr(obj, "sf_column") does not point to a geometry column.
#> Did you rename it, without setting st_geometry(obj) <- "newname"?
{code}

This causes an error because the restored class has assumptions about the contents of the data frame that we can't necessarily know about (or would have to hard code for every data frame subclass).

I can see why {{arrow::write_parquet()}} and {{arrow::read_parquet()}} (and feather, ipc_stream) might want to do this to faithfully roundtrip a data frame, and because the write/read roundtrip (usually) involves the same columns and the same rows, it's probably safe to restore metadata by default.

 The query engine does a lot of transformations that can break assumptions like the one I've shown above (where sf expects a certain column to exist and errors otherwise in a way that the user can't work around). Rather than hard-code the assumptions of every data.frame and vector subclass, I wonder if ignoring the R metadata for query engine output would be a better strategy. If it's not the default, it would be nice to provide an escape hatch for users or developers that find themselves in this position with no workaround.

With the addition of the vctrs extension type, there is a route to preserve attributes through the query engine (although it's a bit verbose). We could make it easier to do (e.g., by interpreting `I()` or `rlang::box()` in some way).

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

df <- data.frame(int_col = 1:5)
attr(df$int_col, "some_attr") <- "some_value"

tf <- tempfile()

#  attributes dropped when column is renamed
write_dataset(df, tf)

open_dataset(tf) %>% 
  select(other_int_col = int_col) %>% 
  collect() %>% 
  pull()
#> [1] 1 2 3 4 5

# attributes preserved when column is renamed
table <- arrow_table(int_col = vctrs_extension_array(df$int_col))
write_dataset(table, tf)

open_dataset(tf) %>% 
  select(other_int_col = int_col) %>% 
  collect() %>% 
  pull()
#> [1] 1 2 3 4 5
#> attr(,"some_attr")
#> [1] "some_value"
{code}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)