You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/29 14:32:33 UTC
[GitHub] [arrow] paleolimbot edited a comment on pull request #12467: ARROW-15471: [R] ExtensionType support in R
paleolimbot edited a comment on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1081896525
A few more modifications:
- I moved conversion of extension types to R objects into the Converter API and removed the modifications to ChunkedArray, Table, and RecordBatch (Romain suggested this). This feels a lot better and is more likely to "just work" in more places.
- I implemented the geoarrow side of this to make sure it will work. It does (except for a bit in one of the compute kernels where Concatenate doesn't work for extension types)! See details and https://github.com/paleolimbot/geoarrow/pull/7
- I did play with reversing the order of instantiation of the C++ and R6 objects...I think that change is a kind of a big one with respect to how all objects get passed around in our current implementation. I did, however, rename the methods to match the C++ method names and it feels a lot better now.
<details>
``` r
# remotes::install_github("apache/arrow#12467")
# remotes::install_github("paleolimbot/geoarrow@arrow-ext-type")
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(geoarrow)
places_folder <- system.file("example_dataset/osm_places", package = "geoarrow")
places <- open_dataset(places_folder)
places$schema$geometry$type
#> GeoArrowType
#> point GEOGCS["WGS 84",DATUM["WGS_...
places$schema$geometry$type$crs
#> [1] "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Longitude\",EAST],AXIS[\"Latitude\",NORTH]]"
# works!
Scanner$create(places)$ToTable()
#> Table
#> 7255 rows x 6 columns
#> $osm_id <string>
#> $code <int32>
#> $population <double>
#> $name <string>
#> $geometry <point GEOGCS["WGS 84",DATUM["WGS_...>
#> $fclass <string>
#>
#> See $metadata for additional Schema metadata
# works!
as.data.frame(Scanner$create(places)$ToTable())
#> # A tibble: 7,255 × 6
#> osm_id code population name geometry fclass
#> <chr> <int> <dbl> <chr> <wk_wkb> <chr>
#> 1 21040334 1001 50781 Roskilde <POINT (12.08192 55.64335)> city
#> 2 21040360 1001 72398 Esbjerg <POINT (8.452075 55.46649)> city
#> 3 26559154 1001 62687 Randers <POINT (10.03715 56.46175)> city
#> 4 26559170 1001 60508 Kolding <POINT (9.47905 55.4895)> city
#> 5 26559198 1001 56567 Vejle <POINT (9.533324 55.70001)> city
#> 6 26559213 1001 273077 Aarhus <POINT (10.2134 56.14963)> city
#> 7 26559274 1001 178210 Odense <POINT (10.38521 55.39972)> city
#> 8 1368129781 1001 58646 Horsens <POINT (9.844477 55.86117)> city
#> 9 2247730880 1001 114194 Aalborg <POINT (9.921526 57.04626)> city
#> 10 393558713 1030 0 Englebjerggård <POINT (11.77737 55.2004)> farm
#> # … with 7,245 more rows
# unfortunately, this fails...
places %>%
filter(population > 100000) %>%
select(name, population, fclass, geometry) %>%
arrange(desc(population)) %>%
collect()
#> Error in `handle_csv_read_error()` at r/R/dplyr-collect.R:33:6:
#> ! NotImplemented: concatenation of extension<geoarrow.point>
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195 VisitTypeInline(*out_->type, this)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590 ConcatenateImpl(data, pool).Concatenate(&out_data)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025 Concatenate(values.chunks(), ctx->memory_pool())
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084 TakeCA(*table.column(j), indices, options, ctx)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:375 impl_->DoFinish()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:484 iterator_.Next()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337 ReadNext(&batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351 ToRecordBatches()
# ...unless we unregister the extension type and use geoarrow_collect()
arrow::unregister_extension_type("geoarrow.point")
open_dataset(places_folder) %>%
filter(population > 100000) %>%
select(name, population, fclass, geometry) %>%
arrange(desc(population)) %>%
geoarrow_collect()
#> # A tibble: 5 × 4
#> name population fclass geometry
#> <chr> <dbl> <chr> <wk_wkb>
#> 1 København 613288 national_capital <POINT (12.57007 55.68672)>
#> 2 Aarhus 273077 city <POINT (10.2134 56.14963)>
#> 3 Odense 178210 city <POINT (10.38521 55.39972)>
#> 4 Aalborg 114194 city <POINT (9.921526 57.04626)>
#> 5 Frederiksberg 102029 suburb <POINT (12.53262 55.67802)>
```
<sup>Created on 2022-03-29 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org