You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/01 22:45:47 UTC

[GitHub] [arrow] jonkeane opened a new pull request #10646: ARROW-13189: [R] Disable row-level metadata application on datasets

jonkeane opened a new pull request #10646:
URL: https://github.com/apache/arrow/pull/10646


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10646: ARROW-13189: [R] Disable row-level metadata application on datasets

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10646:
URL: https://github.com/apache/arrow/pull/10646#discussion_r670424361



##########
File path: r/R/metadata.R
##########
@@ -60,9 +61,23 @@ apply_arrow_r_metadata <- function(x, r_metadata) {
         }
       }
     } else if (is.list(x) && !inherits(x, "POSIXlt") && !is.null(columns_metadata)) {
-      x <- map2(x, columns_metadata, function(.x, .y) {
-        apply_arrow_r_metadata(.x, .y)
-      })
+      # If we have a list and "columns_metadata" this applies row-level metadata
+      # inside of a column in a dataframe.
+
+      # However, if we are inside of a dataset collection, we cannot apply this
+      # row-level metadata, since the order of the rows is not gauranteed to be
+      # the same, so don't even try, but warn what's going on
+      stack <- call_stack()
+      in_dataset_collect <- any(map_lgl(stack, function(x) {
+        !is.null(x$fn_name) && x$fn_name == "collect.arrow_dplyr_query"
+      }))
+      if (in_dataset_collect) {
+        warning("Row-level metadata has been discarded")

Review comment:
       Does this message deserve more explanation? Trying to think of what I would do if I saw that message.
   
   Also: 
   
   ```suggestion
           warning("Row-level metadata has been discarded", call. = FALSE)
   ```

##########
File path: r/tests/testthat/test-metadata.R
##########
@@ -205,3 +205,27 @@ test_that("metadata of list elements (ARROW-10386)", {
   expect_identical(attr(as.data.frame(tab)$x[[1]], "foo"), "bar")
   expect_identical(attr(as.data.frame(tab)$x[[2]], "baz"), "qux")
 })
+
+
+test_that("metadata of list elements (ARROW-10386)", {
+  skip_if_not_available("dataset")
+  skip_if_not_available("parquet")
+  df <- tibble::tibble(
+    metadata = list(
+      structure(1, my_value_as_attr = 1),
+      structure(2, my_value_as_attr = 2),
+      structure(3, my_value_as_attr = 3),
+      structure(4, my_value_as_attr = 3)),
+    part = c(1, 3, 2, 1)
+  )
+
+  dst_dir <- make_temp_dir()
+  write_dataset(df, dst_dir, partitioning = "part")
+  ds <- open_dataset(dst_dir)
+  expect_warning(
+    df_from_ds <- dplyr::collect(ds),

Review comment:
       What happens if I `ds %>% select(part) %>% collect()`, do I get the warning? I'd think I shouldn't.

##########
File path: r/R/metadata.R
##########
@@ -60,9 +61,23 @@ apply_arrow_r_metadata <- function(x, r_metadata) {
         }
       }
     } else if (is.list(x) && !inherits(x, "POSIXlt") && !is.null(columns_metadata)) {
-      x <- map2(x, columns_metadata, function(.x, .y) {
-        apply_arrow_r_metadata(.x, .y)
-      })
+      # If we have a list and "columns_metadata" this applies row-level metadata
+      # inside of a column in a dataframe.
+
+      # However, if we are inside of a dataset collection, we cannot apply this
+      # row-level metadata, since the order of the rows is not gauranteed to be

Review comment:
       ```suggestion
         # row-level metadata, since the order of the rows is not guaranteed to be
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10646: ARROW-13189: [R] Disable row-level metadata application on datasets

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10646:
URL: https://github.com/apache/arrow/pull/10646#discussion_r670703153



##########
File path: r/tests/testthat/test-metadata.R
##########
@@ -205,3 +205,27 @@ test_that("metadata of list elements (ARROW-10386)", {
   expect_identical(attr(as.data.frame(tab)$x[[1]], "foo"), "bar")
   expect_identical(attr(as.data.frame(tab)$x[[2]], "baz"), "qux")
 })
+
+
+test_that("metadata of list elements (ARROW-10386)", {
+  skip_if_not_available("dataset")
+  skip_if_not_available("parquet")
+  df <- tibble::tibble(
+    metadata = list(
+      structure(1, my_value_as_attr = 1),
+      structure(2, my_value_as_attr = 2),
+      structure(3, my_value_as_attr = 3),
+      structure(4, my_value_as_attr = 3)),
+    part = c(1, 3, 2, 1)
+  )
+
+  dst_dir <- make_temp_dir()
+  write_dataset(df, dst_dir, partitioning = "part")
+  ds <- open_dataset(dst_dir)
+  expect_warning(
+    df_from_ds <- dplyr::collect(ds),

Review comment:
       Yup, that works without warning (like it should as you say):
   
   ```
   > ds %>% select(part) %>% collect()
   # A tibble: 4 x 1
      part
   * <int>
   1     1
   2     1
   3     2
   4     3
   > 
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10646: ARROW-13189: [R] Disable row-level metadata application on datasets

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10646:
URL: https://github.com/apache/arrow/pull/10646#discussion_r668900533



##########
File path: r/R/metadata.R
##########
@@ -60,9 +61,23 @@ apply_arrow_r_metadata <- function(x, r_metadata) {
         }
       }
     } else if (is.list(x) && !inherits(x, "POSIXlt") && !is.null(columns_metadata)) {
-      x <- map2(x, columns_metadata, function(.x, .y) {
-        apply_arrow_r_metadata(.x, .y)
-      })
+      # If we have a list and "columns_metadata" this is apply row-level metadata

Review comment:
       nit: looks like a typo where you say "this is apply"?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10646: ARROW-13189: [R] Disable row-level metadata application on datasets

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10646:
URL: https://github.com/apache/arrow/pull/10646#discussion_r668900533



##########
File path: r/R/metadata.R
##########
@@ -60,9 +61,23 @@ apply_arrow_r_metadata <- function(x, r_metadata) {
         }
       }
     } else if (is.list(x) && !inherits(x, "POSIXlt") && !is.null(columns_metadata)) {
-      x <- map2(x, columns_metadata, function(.x, .y) {
-        apply_arrow_r_metadata(.x, .y)
-      })
+      # If we have a list and "columns_metadata" this is apply row-level metadata

Review comment:
       nit: looks like a typo where you say "this is apply"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on pull request #10646: ARROW-13189: [R] Disable row-level metadata application on datasets

Posted by GitBox <gi...@apache.org>.
thisisnic commented on pull request #10646:
URL: https://github.com/apache/arrow/pull/10646#issuecomment-879212830


   Other than minor thing about typo (or phrasing) LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10646: ARROW-13189: [R] Disable row-level metadata application on datasets

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10646:
URL: https://github.com/apache/arrow/pull/10646#discussion_r670703573



##########
File path: r/tests/testthat/test-metadata.R
##########
@@ -205,3 +205,27 @@ test_that("metadata of list elements (ARROW-10386)", {
   expect_identical(attr(as.data.frame(tab)$x[[1]], "foo"), "bar")
   expect_identical(attr(as.data.frame(tab)$x[[2]], "baz"), "qux")
 })
+
+
+test_that("metadata of list elements (ARROW-10386)", {
+  skip_if_not_available("dataset")
+  skip_if_not_available("parquet")
+  df <- tibble::tibble(
+    metadata = list(
+      structure(1, my_value_as_attr = 1),
+      structure(2, my_value_as_attr = 2),
+      structure(3, my_value_as_attr = 3),
+      structure(4, my_value_as_attr = 3)),
+    part = c(1, 3, 2, 1)
+  )
+
+  dst_dir <- make_temp_dir()
+  write_dataset(df, dst_dir, partitioning = "part")
+  ds <- open_dataset(dst_dir)
+  expect_warning(
+    df_from_ds <- dplyr::collect(ds),

Review comment:
       I'll add a test that asserts that though, it's a quick + easy test
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10646: ARROW-13189: [R] Disable row-level metadata application on datasets

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10646:
URL: https://github.com/apache/arrow/pull/10646#issuecomment-872594250


   https://issues.apache.org/jira/browse/ARROW-13189


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on pull request #10646: ARROW-13189: [R] Disable row-level metadata application on datasets

Posted by GitBox <gi...@apache.org>.
jonkeane commented on pull request #10646:
URL: https://github.com/apache/arrow/pull/10646#issuecomment-880957846


   Ok, I've made a few changes (including avoiding the _now_ deprecated `call_stack()`). I've also added this to prevent _writing_ this metadata with datasets. I've expanded the warning messages a bit more. The messages themselves could be longer, though I think it might be better to have the longer explanation in `NEWS` and we can point to {sfarrow} there for saving sf data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #10646: ARROW-13189: [R] Disable row-level metadata application on datasets

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #10646:
URL: https://github.com/apache/arrow/pull/10646


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org