You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/29 11:24:21 UTC
[GitHub] [arrow] romainfrancois opened a new pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip
romainfrancois opened a new pull request #8549:
URL: https://github.com/apache/arrow/pull/8549
On the original example:
``` r
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(sf)
#> Warning: package 'sf' was built under R version 4.0.2
#> Linking to GEOS 3.8.1, GDAL 3.1.1, PROJ 6.3.1
fname <- system.file("shape/nc.shp", package="sf")
df_spatial <- st_read(fname)
#> Reading layer `nc' from data source `/Users/romainfrancois/.R/library/4.0/sf/shape/nc.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> geographic CRS: NAD27
df_spatial
#> Simple feature collection with 100 features and 14 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> geographic CRS: NAD27
#> First 10 features:
#> AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
#> 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1
#> 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0
#> 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5
#> 4 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1
#> 5 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9
#> 6 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7
#> 7 0.062 1.547 1834 1834 Camden 37029 37029 15 286 0
#> 8 0.091 1.284 1835 1835 Gates 37073 37073 37 420 0
#> 9 0.118 1.421 1836 1836 Warren 37185 37185 93 968 4
#> 10 0.124 1.428 1837 1837 Stokes 37169 37169 85 1612 1
#> NWBIR74 BIR79 SID79 NWBIR79 geometry
#> 1 10 1364 0 19 MULTIPOLYGON (((-81.47276 3...
#> 2 10 542 3 12 MULTIPOLYGON (((-81.23989 3...
#> 3 208 3616 6 260 MULTIPOLYGON (((-80.45634 3...
#> 4 123 830 2 145 MULTIPOLYGON (((-76.00897 3...
#> 5 1066 1606 3 1197 MULTIPOLYGON (((-77.21767 3...
#> 6 954 1838 5 1237 MULTIPOLYGON (((-76.74506 3...
#> 7 115 350 2 139 MULTIPOLYGON (((-76.00897 3...
#> 8 254 594 2 371 MULTIPOLYGON (((-76.56251 3...
#> 9 748 1190 2 844 MULTIPOLYGON (((-78.30876 3...
#> 10 160 2038 5 176 MULTIPOLYGON (((-80.02567 3...
tab <- Table$create(df_spatial)
roundtripped <- as.data.frame(tab)
roundtripped
#> Simple feature collection with 100 features and 14 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> geographic CRS: NAD27
#> First 10 features:
#> AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
#> 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1
#> 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0
#> 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5
#> 4 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1
#> 5 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9
#> 6 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7
#> 7 0.062 1.547 1834 1834 Camden 37029 37029 15 286 0
#> 8 0.091 1.284 1835 1835 Gates 37073 37073 37 420 0
#> 9 0.118 1.421 1836 1836 Warren 37185 37185 93 968 4
#> 10 0.124 1.428 1837 1837 Stokes 37169 37169 85 1612 1
#> NWBIR74 BIR79 SID79 NWBIR79 geometry
#> 1 10 1364 0 19 MULTIPOLYGON (((-81.47276 3...
#> 2 10 542 3 12 MULTIPOLYGON (((-81.23989 3...
#> 3 208 3616 6 260 MULTIPOLYGON (((-80.45634 3...
#> 4 123 830 2 145 MULTIPOLYGON (((-76.00897 3...
#> 5 1066 1606 3 1197 MULTIPOLYGON (((-77.21767 3...
#> 6 954 1838 5 1237 MULTIPOLYGON (((-76.74506 3...
#> 7 115 350 2 139 MULTIPOLYGON (((-76.00897 3...
#> 8 254 594 2 371 MULTIPOLYGON (((-76.56251 3...
#> 9 748 1190 2 844 MULTIPOLYGON (((-78.30876 3...
#> 10 160 2038 5 176 MULTIPOLYGON (((-80.02567 3...
attributes(roundtripped$geometry[[1]])
#> $class
#> [1] "XY" "MULTIPOLYGON" "sfg"
attributes(df_spatial$geometry[[1]])
#> $class
#> [1] "XY" "MULTIPOLYGON" "sfg"
```
<sup>Created on 2020-10-29 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001)</sup>
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip
Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8549:
URL: https://github.com/apache/arrow/pull/8549#issuecomment-759688988
Closing in favor of #9182; relevant commits cherry-picked to there.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] romainfrancois commented on pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip
Posted by GitBox <gi...@apache.org>.
romainfrancois commented on pull request #8549:
URL: https://github.com/apache/arrow/pull/8549#issuecomment-718689459
This increases again the weight of the metadata, which now has to include attributes for each element of a list column, aka a struct array, but this seems on this example to work.
Can't use `sf` on the tests, so using something simpler:
```r
test_that("metadata of list elements (ARROW-10386)", {
df <- data.frame(x = I(list(structure(1, foo = "bar"), structure(2, foo = "bar"))))
tab <- Table$create(df)
expect_identical(attr(as.data.frame(tab)$x[[1]], "foo"), "bar")
expect_identical(attr(as.data.frame(tab)$x[[2]], "foo"), "bar")
})
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8549:
URL: https://github.com/apache/arrow/pull/8549#issuecomment-718697703
https://issues.apache.org/jira/browse/ARROW-10386
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip
Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #8549:
URL: https://github.com/apache/arrow/pull/8549
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip
Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8549:
URL: https://github.com/apache/arrow/pull/8549#issuecomment-754251796
> This increases again the weight of the metadata, which now has to include attributes for each element of a list column
How much does it blow up the metadata? Is this going to scale to be able to handle normal/large shapefiles?
Is there a more efficient representation for this metadata (considering that we have to serialize it to a string)? Should we special-case `sf` objects?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jonkeane commented on a change in pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip
Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #8549:
URL: https://github.com/apache/arrow/pull/8549#discussion_r551552588
##########
File path: r/R/record-batch.R
##########
@@ -286,6 +286,20 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, optional = FALSE, ...
apply_arrow_r_metadata <- function(x, r_metadata) {
tryCatch({
+ columns_metadata <- r_metadata$columns
+ if (is.data.frame(x)) {
+ if (length(names(x)) && !is.null(columns_metadata)) {
+ for (name in intersect(names(columns_metadata), names(x))) {
+ x[[name]] <- apply_arrow_r_metadata(x[[name]], columns_metadata[[name]])
+ }
+ }
+ } else if(is.list(x) && !inherits(x, "POSIXlt") && !is.null(columns_metadata)) {
+ x <- map2(x, columns_metadata, function(.x, .y) {
+ apply_arrow_r_metadata(.x, .y)
+ })
+ x
+ }
Review comment:
Could we move `!is.null(columns_metadata)` up to line 290 instead of having it on both 291 and 296?
##########
File path: r/tests/testthat/test-metadata.R
##########
@@ -134,3 +134,10 @@ test_that("metadata keeps attribute of top level data frame", {
expect_identical(attr(as.data.frame(tab), "foo"), "bar")
expect_identical(as.data.frame(tab), df)
})
+
+test_that("metadata of list elements (ARROW-10386)", {
+ df <- data.frame(x = I(list(structure(1, foo = "bar"), structure(2, foo = "bar"))))
+ tab <- Table$create(df)
+ expect_identical(attr(as.data.frame(tab)$x[[1]], "foo"), "bar")
+ expect_identical(attr(as.data.frame(tab)$x[[2]], "foo"), "bar")
Review comment:
I wonder if it would be clearer to have different attributes on each item/row to make it super obvious that we're not picking the attributes of the first item/row and copying them for the whole column.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org