You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/29 14:32:33 UTC

[GitHub] [arrow] paleolimbot edited a comment on pull request #12467: ARROW-15471: [R] ExtensionType support in R

paleolimbot edited a comment on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1081896525


   A few more modifications:
   
   - I moved conversion of extension types to R objects into the Converter API and removed the modifications to ChunkedArray, Table, and RecordBatch (Romain suggested this). This feels a lot better and is more likely to "just work" in more places.
   - I implemented the geoarrow side of this to make sure it will work. It does (except for a bit in one of the compute kernels where Concatenate doesn't work for extension types)! See details and https://github.com/paleolimbot/geoarrow/pull/7
   - I did play with reversing the order of instantiation of the C++ and R6 objects...I think that change is a kind of a big one with respect to how all objects get passed around in our current implementation. I did, however, rename the methods to match the C++ method names and it feels a lot better now.
   
   <details>
   
   ``` r
   # remotes::install_github("apache/arrow#12467")
   # remotes::install_github("paleolimbot/geoarrow@arrow-ext-type")
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   library(geoarrow)
   
   places_folder <- system.file("example_dataset/osm_places", package = "geoarrow")
   places <- open_dataset(places_folder)
   places$schema$geometry$type
   #> GeoArrowType
   #> point GEOGCS["WGS 84",DATUM["WGS_...
   places$schema$geometry$type$crs
   #> [1] "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Longitude\",EAST],AXIS[\"Latitude\",NORTH]]"
   
   # works!
   Scanner$create(places)$ToTable()
   #> Table
   #> 7255 rows x 6 columns
   #> $osm_id <string>
   #> $code <int32>
   #> $population <double>
   #> $name <string>
   #> $geometry <point GEOGCS["WGS 84",DATUM["WGS_...>
   #> $fclass <string>
   #> 
   #> See $metadata for additional Schema metadata
   
   # works!
   as.data.frame(Scanner$create(places)$ToTable())
   #> # A tibble: 7,255 × 6
   #>    osm_id      code population name           geometry                    fclass
   #>    <chr>      <int>      <dbl> <chr>          <wk_wkb>                    <chr> 
   #>  1 21040334    1001      50781 Roskilde       <POINT (12.08192 55.64335)> city  
   #>  2 21040360    1001      72398 Esbjerg        <POINT (8.452075 55.46649)> city  
   #>  3 26559154    1001      62687 Randers        <POINT (10.03715 56.46175)> city  
   #>  4 26559170    1001      60508 Kolding        <POINT (9.47905 55.4895)>   city  
   #>  5 26559198    1001      56567 Vejle          <POINT (9.533324 55.70001)> city  
   #>  6 26559213    1001     273077 Aarhus         <POINT (10.2134 56.14963)>  city  
   #>  7 26559274    1001     178210 Odense         <POINT (10.38521 55.39972)> city  
   #>  8 1368129781  1001      58646 Horsens        <POINT (9.844477 55.86117)> city  
   #>  9 2247730880  1001     114194 Aalborg        <POINT (9.921526 57.04626)> city  
   #> 10 393558713   1030          0 Englebjerggård <POINT (11.77737 55.2004)>  farm  
   #> # … with 7,245 more rows
   
   # unfortunately, this fails...
   places %>% 
     filter(population > 100000) %>% 
     select(name, population, fclass, geometry) %>% 
     arrange(desc(population)) %>% 
     collect()
   #> Error in `handle_csv_read_error()` at r/R/dplyr-collect.R:33:6:
   #> ! NotImplemented: concatenation of extension<geoarrow.point>
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195  VisitTypeInline(*out_->type, this)
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590  ConcatenateImpl(data, pool).Concatenate(&out_data)
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025  Concatenate(values.chunks(), ctx->memory_pool())
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084  TakeCA(*table.column(j), indices, options, ctx)
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:375  impl_->DoFinish()
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:484  iterator_.Next()
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337  ReadNext(&batch)
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351  ToRecordBatches()
   
   # ...unless we unregister the extension type and use geoarrow_collect()
   arrow::unregister_extension_type("geoarrow.point")
   open_dataset(places_folder) %>% 
     filter(population > 100000) %>% 
     select(name, population, fclass, geometry) %>% 
     arrange(desc(population)) %>% 
     geoarrow_collect()
   #> # A tibble: 5 × 4
   #>   name          population fclass           geometry                   
   #>   <chr>              <dbl> <chr>            <wk_wkb>                   
   #> 1 København         613288 national_capital <POINT (12.57007 55.68672)>
   #> 2 Aarhus            273077 city             <POINT (10.2134 56.14963)> 
   #> 3 Odense            178210 city             <POINT (10.38521 55.39972)>
   #> 4 Aalborg           114194 city             <POINT (9.921526 57.04626)>
   #> 5 Frederiksberg     102029 suburb           <POINT (12.53262 55.67802)>
   ```
   
   <sup>Created on 2022-03-29 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org