You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/06 09:38:50 UTC

[GitHub] [arrow] thisisnic commented on a diff in pull request #12799: ARROW-15827: [R] Improve UX of write_dataset(..., max_rows_per_group)

thisisnic commented on code in PR #12799:
URL: https://github.com/apache/arrow/pull/12799#discussion_r843723852


##########
r/R/dataset-write.R:
##########
@@ -153,6 +153,16 @@ write_dataset <- function(dataset,
     }
   }
 
+  if (!missing(max_rows_per_file) && max_rows_per_group > max_rows_per_file) {
+    if (!missing(max_rows_per_group)) {
+      warning(paste0(c(
+        "'max_rows_per_group' must be less or equal to 'max_rows_per_file'.",
+        "\n'max_rows_per_group' set to value of 'max_rows_per_file'."

Review Comment:
   ```suggestion
           "`max_rows_per_group` must be less or equal to `max_rows_per_file`.",
           "\n`max_rows_per_group` set to value of `max_rows_per_file`."
   ```
   
   Nice informative warning here.  Typically, we use backticks (```) to refer to parameter values in these kinds of messages.
   
   How about though, we adjust the conditions under which the change in line 163 below is triggered to only include when `max_rows_per_file` has been set (i.e. isn't missing) *and* `max_rows_per_group` hasn't been changed from the default (i.e. is "missing")?  
   
   I'm hesitant to make too many assumptions about user intentions, but I think we could safely say here that it's in those circumstances the user most likely wants to set the maximum rows per file and just hasn't paid attention to the `max_rows_per_group` parameter, and so we can just update the value silently with no warning. 



##########
r/tests/testthat/test-dataset-write.R:
##########
@@ -506,6 +506,47 @@ test_that("Max partitions fails with non-integer values and less than required p
   )
 })
 
+test_that("max_rows_per_group is adjusted if at odds with max_rows_per_file", {
+  skip_if_not_available("parquet")
+  df <- tibble::tibble(
+    int = 1:10,
+    dbl = as.numeric(1:10),
+    lgl = rep(c(TRUE, FALSE, NA, TRUE, FALSE), 2),
+    chr = letters[1:10],
+  )
+  dst_dir <- make_temp_dir()
+
+  # max_rows_per_group unset => pass
+  expect_silent(
+    write_dataset(df, dst_dir, max_rows_per_file = 5)
+  )
+
+  expect_equal(
+    {
+      write_dataset(df, dst_dir, max_rows_per_file = 5)
+      list.files(dst_dir, "part-") %>%
+        length()
+    },
+    2
+  )

Review Comment:
   Great attention to detail, but I think we can remove this test, as it's a little out of scope of this ticket - it's basically testing that `max_rows_per_file` works as intended (which should be tested in the C++ layer anyway), rather than the specific thing this ticket addresses (adjusting the behaviour when the user specified `max_rows_per_file` at odds with `max_rows_per_group`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org