You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/05 13:12:30 UTC

[GitHub] [arrow] assignUser opened a new pull request, #12799: ARROW-15827: [R] Improve UX of write_dataset(..., max_rows_per_group)

assignUser opened a new pull request, #12799:
URL: https://github.com/apache/arrow/pull/12799

   @thisisnic could I bother you for a review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on a diff in pull request #12799: ARROW-15827: [R] Improve UX of write_dataset(..., max_rows_per_group)

Posted by GitBox <gi...@apache.org>.
thisisnic commented on code in PR #12799:
URL: https://github.com/apache/arrow/pull/12799#discussion_r843723852


##########
r/R/dataset-write.R:
##########
@@ -153,6 +153,16 @@ write_dataset <- function(dataset,
     }
   }
 
+  if (!missing(max_rows_per_file) && max_rows_per_group > max_rows_per_file) {
+    if (!missing(max_rows_per_group)) {
+      warning(paste0(c(
+        "'max_rows_per_group' must be less or equal to 'max_rows_per_file'.",
+        "\n'max_rows_per_group' set to value of 'max_rows_per_file'."

Review Comment:
   ```suggestion
           "`max_rows_per_group` must be less or equal to `max_rows_per_file`.",
           "\n`max_rows_per_group` set to value of `max_rows_per_file`."
   ```
   
   Nice informative warning here.  Typically, we use backticks (```) to refer to parameter values in these kinds of messages.
   
   How about though, we adjust the conditions under which the change in line 163 below is triggered to only include when `max_rows_per_file` has been set (i.e. isn't missing) *and* `max_rows_per_group` hasn't been changed from the default (i.e. is "missing")?  
   
   I'm hesitant to make too many assumptions about user intentions, but I think we could safely say here that it's in those circumstances the user most likely wants to set the maximum rows per file and just hasn't paid attention to the `max_rows_per_group` parameter, and so we can just update the value silently with no warning. 



##########
r/tests/testthat/test-dataset-write.R:
##########
@@ -506,6 +506,47 @@ test_that("Max partitions fails with non-integer values and less than required p
   )
 })
 
+test_that("max_rows_per_group is adjusted if at odds with max_rows_per_file", {
+  skip_if_not_available("parquet")
+  df <- tibble::tibble(
+    int = 1:10,
+    dbl = as.numeric(1:10),
+    lgl = rep(c(TRUE, FALSE, NA, TRUE, FALSE), 2),
+    chr = letters[1:10],
+  )
+  dst_dir <- make_temp_dir()
+
+  # max_rows_per_group unset => pass
+  expect_silent(
+    write_dataset(df, dst_dir, max_rows_per_file = 5)
+  )
+
+  expect_equal(
+    {
+      write_dataset(df, dst_dir, max_rows_per_file = 5)
+      list.files(dst_dir, "part-") %>%
+        length()
+    },
+    2
+  )

Review Comment:
   Great attention to detail, but I think we can remove this test, as it's a little out of scope of this ticket - it's basically testing that `max_rows_per_file` works as intended (which should be tested in the C++ layer anyway), rather than the specific thing this ticket addresses (adjusting the behaviour when the user specified `max_rows_per_file` at odds with `max_rows_per_group`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] assignUser commented on a diff in pull request #12799: ARROW-15827: [R] Improve UX of write_dataset(..., max_rows_per_group)

Posted by GitBox <gi...@apache.org>.
assignUser commented on code in PR #12799:
URL: https://github.com/apache/arrow/pull/12799#discussion_r843728947


##########
r/tests/testthat/test-dataset-write.R:
##########
@@ -506,6 +506,47 @@ test_that("Max partitions fails with non-integer values and less than required p
   )
 })
 
+test_that("max_rows_per_group is adjusted if at odds with max_rows_per_file", {
+  skip_if_not_available("parquet")
+  df <- tibble::tibble(
+    int = 1:10,
+    dbl = as.numeric(1:10),
+    lgl = rep(c(TRUE, FALSE, NA, TRUE, FALSE), 2),
+    chr = letters[1:10],
+  )
+  dst_dir <- make_temp_dir()
+
+  # max_rows_per_group unset => pass
+  expect_silent(
+    write_dataset(df, dst_dir, max_rows_per_file = 5)
+  )
+
+  expect_equal(
+    {
+      write_dataset(df, dst_dir, max_rows_per_file = 5)
+      list.files(dst_dir, "part-") %>%
+        length()
+    },
+    2
+  )

Review Comment:
   Agree, I was unsure writing it and you worded my doubts clearly, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12799: ARROW-15827: [R] Improve UX of write_dataset(..., max_rows_per_group)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12799:
URL: https://github.com/apache/arrow/pull/12799#issuecomment-1088687751

   https://issues.apache.org/jira/browse/ARROW-15827


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12799: ARROW-15827: [R] Improve UX of write_dataset(..., max_rows_per_group)

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12799:
URL: https://github.com/apache/arrow/pull/12799#issuecomment-1091054574

   Benchmark runs are scheduled for baseline = e2287f9248e9bed5bde96e152b138b9b3367e181 and contender = 88eea9cf6b0a9bcfffa8d75b2e2e1f98a81d4c73. 88eea9cf6b0a9bcfffa8d75b2e2e1f98a81d4c73 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/08eb5e43202a4479892f063ae80fee45...f8870ed04ea547a29451a374276821c0/)
   [Finished :arrow_down:0.17% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/1ce0bb640f904aa0a62215c7ebcc0f32...f05d7f4e74474498a6843539eec52b99/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/1c88721c21014bf6bbb457166d81e1c8...f3b3d5c5bb0f4a0ab28d1903ec5e4c1a/)
   [Finished :arrow_down:0.09% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/7089a380bd0f4de895ed5b89d76d6984...641750d6d7d5424ca069d6f1b6447721/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/456| `88eea9cf` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/441| `88eea9cf` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/442| `88eea9cf` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/451| `88eea9cf` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/455| `e2287f92` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/440| `e2287f92` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/441| `e2287f92` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/450| `e2287f92` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic closed pull request #12799: ARROW-15827: [R] Improve UX of write_dataset(..., max_rows_per_group)

Posted by GitBox <gi...@apache.org>.
thisisnic closed pull request #12799: ARROW-15827: [R] Improve UX of write_dataset(..., max_rows_per_group)
URL: https://github.com/apache/arrow/pull/12799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org