You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/05/10 20:08:00 UTC
[jira] [Updated] (ARROW-12542) [R] SF columns in datasets with filters

     [ https://issues.apache.org/jira/browse/ARROW-12542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson updated ARROW-12542:
------------------------------------
        Fix Version/s: 5.0.0
    Affects Version/s: 4.0.0

> [R] SF columns in datasets with filters
> ---------------------------------------
>
>                 Key: ARROW-12542
>                 URL: https://issues.apache.org/jira/browse/ARROW-12542
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 4.0.0
>            Reporter: Jonathan Keane
>            Assignee: Jonathan Keane
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 5.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> First reported at https://issues.apache.org/jira/browse/ARROW-10386?focusedCommentId=17331668&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17331668
> OK, I actually have recreated a similar issue. In the following code, I create an sf object and write it as a dataset to parquet files. I then call open_dataset() on the files.
> If I collect() the dataset I get back an sf object, no problem.
> But if I first filter() the dataset then collect() I get an error.
> {code:r}
> library(sf)
> library(arrow)
> library(dplyr)
> n <- 10000
> fake <- tibble(
>     ID=seq(n),
>     Date=sample(seq(as.Date('2019-01-01'), as.Date('2021-04-01'), by=1), size=n, replace=TRUE),
>     x=runif(n=n, min=-170, max=170),
>     y=runif(n=n, min=-60, max=70),
>     text1=sample(x=state.name, size=n, replace=TRUE),
>     text2=sample(x=state.name, size=n, replace=TRUE),
>     text3=sample(x=state.division, size=n, replace=TRUE),
>     text4=sample(x=state.region, size=n, replace=TRUE),
>     text5=sample(x=state.abb, size=n, replace=TRUE),
>     num1=sample(x=state.center$x, size=n, replace=TRUE),
>     num2=sample(x=state.center$y, size=n, replace=TRUE),
>     num3=sample(x=state.area, size=n, replace=TRUE),
>     Rand1=rnorm(n=n),
>     Rand2=rnorm(n=n, mean=100, sd=3),
>     Rand3=rbinom(n=n, size=10, prob=0.4)
> )
> # make it into an sf object
> spat <- fake %>% 
>     st_as_sf(coords=c('x', 'y'), remove=FALSE, crs = 4326)
> class(spat)
> class(spat$geometry)
> # create new columns for partitioning and write to disk
> spat %>% 
>     mutate(Year=lubridate::year(Date), Month=lubridate::month(Date)) %>% 
>     group_by(Year, Month) %>% 
>     write_dataset('data/splits/', format='parquet')
> spat_in <- open_dataset('data/splits/')
> class(spat_in)
> # it's an sf as expected
> spat_in %>% collect() %>% class()
> spat_in %>% collect() %>% pull(geometry) %>% class()
> # it even plots
> leaflet::leaflet() %>% 
>     leaflet::addTiles() %>% 
>     leafgl::addGlPoints(data=spat_in %>% collect())
> # but if we filter first
> spat_in %>% 
>     filter(Year == 2020 & Month == 2) %>% 
>     collect()
> # we get this error
> Error in st_geometry.sf(x) : 
>   attr(obj, "sf_column") does not point to a geometry column.
> Did you rename it, without setting st_geometry(obj) <- "newname"?
> In addition: Warning message:
> Invalid metadata$r 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)