You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/29 11:24:21 UTC

[GitHub] [arrow] romainfrancois opened a new pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip

romainfrancois opened a new pull request #8549:
URL: https://github.com/apache/arrow/pull/8549


   On the original example: 
   
   ``` r
   library(arrow)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   library(sf)
   #> Warning: package 'sf' was built under R version 4.0.2
   #> Linking to GEOS 3.8.1, GDAL 3.1.1, PROJ 6.3.1
   
   fname <- system.file("shape/nc.shp", package="sf")
   df_spatial <- st_read(fname)
   #> Reading layer `nc' from data source `/Users/romainfrancois/.R/library/4.0/sf/shape/nc.shp' using driver `ESRI Shapefile'
   #> Simple feature collection with 100 features and 14 fields
   #> geometry type:  MULTIPOLYGON
   #> dimension:      XY
   #> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
   #> geographic CRS: NAD27
   df_spatial
   #> Simple feature collection with 100 features and 14 fields
   #> geometry type:  MULTIPOLYGON
   #> dimension:      XY
   #> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
   #> geographic CRS: NAD27
   #> First 10 features:
   #>     AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
   #> 1  0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1
   #> 2  0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0
   #> 3  0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5
   #> 4  0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1
   #> 5  0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9
   #> 6  0.097     1.670  1833    1833    Hertford 37091  37091       46  1452     7
   #> 7  0.062     1.547  1834    1834      Camden 37029  37029       15   286     0
   #> 8  0.091     1.284  1835    1835       Gates 37073  37073       37   420     0
   #> 9  0.118     1.421  1836    1836      Warren 37185  37185       93   968     4
   #> 10 0.124     1.428  1837    1837      Stokes 37169  37169       85  1612     1
   #>    NWBIR74 BIR79 SID79 NWBIR79                       geometry
   #> 1       10  1364     0      19 MULTIPOLYGON (((-81.47276 3...
   #> 2       10   542     3      12 MULTIPOLYGON (((-81.23989 3...
   #> 3      208  3616     6     260 MULTIPOLYGON (((-80.45634 3...
   #> 4      123   830     2     145 MULTIPOLYGON (((-76.00897 3...
   #> 5     1066  1606     3    1197 MULTIPOLYGON (((-77.21767 3...
   #> 6      954  1838     5    1237 MULTIPOLYGON (((-76.74506 3...
   #> 7      115   350     2     139 MULTIPOLYGON (((-76.00897 3...
   #> 8      254   594     2     371 MULTIPOLYGON (((-76.56251 3...
   #> 9      748  1190     2     844 MULTIPOLYGON (((-78.30876 3...
   #> 10     160  2038     5     176 MULTIPOLYGON (((-80.02567 3...
   
   tab <- Table$create(df_spatial)
   roundtripped <- as.data.frame(tab)
   roundtripped
   #> Simple feature collection with 100 features and 14 fields
   #> geometry type:  MULTIPOLYGON
   #> dimension:      XY
   #> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
   #> geographic CRS: NAD27
   #> First 10 features:
   #>     AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
   #> 1  0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1
   #> 2  0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0
   #> 3  0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5
   #> 4  0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1
   #> 5  0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9
   #> 6  0.097     1.670  1833    1833    Hertford 37091  37091       46  1452     7
   #> 7  0.062     1.547  1834    1834      Camden 37029  37029       15   286     0
   #> 8  0.091     1.284  1835    1835       Gates 37073  37073       37   420     0
   #> 9  0.118     1.421  1836    1836      Warren 37185  37185       93   968     4
   #> 10 0.124     1.428  1837    1837      Stokes 37169  37169       85  1612     1
   #>    NWBIR74 BIR79 SID79 NWBIR79                       geometry
   #> 1       10  1364     0      19 MULTIPOLYGON (((-81.47276 3...
   #> 2       10   542     3      12 MULTIPOLYGON (((-81.23989 3...
   #> 3      208  3616     6     260 MULTIPOLYGON (((-80.45634 3...
   #> 4      123   830     2     145 MULTIPOLYGON (((-76.00897 3...
   #> 5     1066  1606     3    1197 MULTIPOLYGON (((-77.21767 3...
   #> 6      954  1838     5    1237 MULTIPOLYGON (((-76.74506 3...
   #> 7      115   350     2     139 MULTIPOLYGON (((-76.00897 3...
   #> 8      254   594     2     371 MULTIPOLYGON (((-76.56251 3...
   #> 9      748  1190     2     844 MULTIPOLYGON (((-78.30876 3...
   #> 10     160  2038     5     176 MULTIPOLYGON (((-80.02567 3...
   
   attributes(roundtripped$geometry[[1]])
   #> $class
   #> [1] "XY"           "MULTIPOLYGON" "sfg"
   attributes(df_spatial$geometry[[1]])
   #> $class
   #> [1] "XY"           "MULTIPOLYGON" "sfg"
   ```
   
   <sup>Created on 2020-10-29 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001)</sup>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8549:
URL: https://github.com/apache/arrow/pull/8549#issuecomment-759688988


   Closing in favor of #9182; relevant commits cherry-picked to there.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on pull request #8549:
URL: https://github.com/apache/arrow/pull/8549#issuecomment-718689459


   This increases again the weight of the metadata, which now has to include attributes for each element of a list column, aka a struct array, but this seems on this example to work. 
   
   Can't use `sf` on the tests, so using something simpler: 
   
   ```r
   test_that("metadata of list elements (ARROW-10386)", {
     df <- data.frame(x = I(list(structure(1, foo = "bar"), structure(2, foo = "bar"))))
     tab <- Table$create(df)
     expect_identical(attr(as.data.frame(tab)$x[[1]], "foo"), "bar")
     expect_identical(attr(as.data.frame(tab)$x[[2]], "foo"), "bar")
   })
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8549:
URL: https://github.com/apache/arrow/pull/8549#issuecomment-718697703


   https://issues.apache.org/jira/browse/ARROW-10386


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #8549:
URL: https://github.com/apache/arrow/pull/8549


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8549:
URL: https://github.com/apache/arrow/pull/8549#issuecomment-754251796


   > This increases again the weight of the metadata, which now has to include attributes for each element of a list column
   
   How much does it blow up the metadata? Is this going to scale to be able to handle normal/large shapefiles? 
   
   Is there a more efficient representation for this metadata (considering that we have to serialize it to a string)? Should we special-case `sf` objects?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #8549:
URL: https://github.com/apache/arrow/pull/8549#discussion_r551552588



##########
File path: r/R/record-batch.R
##########
@@ -286,6 +286,20 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, optional = FALSE, ...
 
 apply_arrow_r_metadata <- function(x, r_metadata) {
   tryCatch({
+    columns_metadata <- r_metadata$columns
+    if (is.data.frame(x)) {
+      if (length(names(x)) && !is.null(columns_metadata)) {
+        for (name in intersect(names(columns_metadata), names(x))) {
+          x[[name]] <- apply_arrow_r_metadata(x[[name]], columns_metadata[[name]])
+        }
+      }
+    } else if(is.list(x) && !inherits(x, "POSIXlt") && !is.null(columns_metadata)) {
+      x <- map2(x, columns_metadata, function(.x, .y) {
+        apply_arrow_r_metadata(.x, .y)
+      })
+      x
+    }

Review comment:
       Could we move `!is.null(columns_metadata)` up to line 290 instead of having it on both 291 and 296?

##########
File path: r/tests/testthat/test-metadata.R
##########
@@ -134,3 +134,10 @@ test_that("metadata keeps attribute of top level data frame", {
   expect_identical(attr(as.data.frame(tab), "foo"), "bar")
   expect_identical(as.data.frame(tab), df)
 })
+
+test_that("metadata of list elements (ARROW-10386)", {
+  df <- data.frame(x = I(list(structure(1, foo = "bar"), structure(2, foo = "bar"))))
+  tab <- Table$create(df)
+  expect_identical(attr(as.data.frame(tab)$x[[1]], "foo"), "bar")
+  expect_identical(attr(as.data.frame(tab)$x[[2]], "foo"), "bar")

Review comment:
       I wonder if it would be clearer to have different attributes on each item/row to make it super obvious that we're not picking the attributes of the first item/row and copying them for the whole column.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org