You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/09 22:16:58 UTC

[GitHub] [arrow] pachamaltese opened a new pull request #9972: ARROW-12325: [R] add max_partitions argument to write_dataset()

pachamaltese opened a new pull request #9972:
URL: https://github.com/apache/arrow/pull/9972


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachamaltese commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachamaltese commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r612716840



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -1778,3 +1778,60 @@ test_that("Collecting zero columns from a dataset doesn't return entire dataset"
     c(32, 0)
   )
 })
+
+# see https://issues.apache.org/jira/browse/ARROW-12315
+test_that("Max partitions fails with non-integer values and less than required partitions values", {
+  skip_if_not_available("parquet")
+  tmp <- tempfile()
+
+  # this example needs 3 partitions
+
+  # max_partitions = chr => error
+  expect_error(
+    mtcars %>%
+      group_by(cyl) %>%
+      write_dataset(tmp, format = "parquet", max_partitions = "foobar")
+  )

Review comment:
       in this other case, the error is 
   ```
    Error: Expected single integer value 
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r612498334



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -1778,3 +1778,60 @@ test_that("Collecting zero columns from a dataset doesn't return entire dataset"
     c(32, 0)
   )
 })
+
+# see https://issues.apache.org/jira/browse/ARROW-12315
+test_that("Max partitions fails with non-integer values and less than required partitions values", {
+  skip_if_not_available("parquet")
+  tmp <- tempfile()
+
+  # this example needs 3 partitions
+
+  # max_partitions = chr => error
+  expect_error(
+    mtcars %>%
+      group_by(cyl) %>%
+      write_dataset(tmp, format = "parquet", max_partitions = "foobar")
+  )

Review comment:
       We should assert what each of these errors contain. We don't need to do the full thing, but let's make sure that they are erroring with something useful about partitions

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -1778,3 +1778,60 @@ test_that("Collecting zero columns from a dataset doesn't return entire dataset"
     c(32, 0)
   )
 })
+
+# see https://issues.apache.org/jira/browse/ARROW-12315
+test_that("Max partitions fails with non-integer values and less than required partitions values", {
+  skip_if_not_available("parquet")
+  tmp <- tempfile()
+
+  # this example needs 3 partitions
+
+  # max_partitions = chr => error
+  expect_error(
+    mtcars %>%
+      group_by(cyl) %>%
+      write_dataset(tmp, format = "parquet", max_partitions = "foobar")
+  )
+
+  # max_partitions < 3 => error
+  expect_error(
+    mtcars %>%
+      group_by(cyl) %>%
+      write_dataset(tmp, format = "parquet", max_partitions = -3)
+  )
+
+  # max_partitions < 3 => error
+  expect_error(
+    mtcars %>%
+      group_by(cyl) %>%
+      write_dataset(tmp, format = "parquet", max_partitions = 1)
+  )

Review comment:
       We especially want to make sure that this error is clear + actionable

##########
File path: r/R/dataset-write.R
##########
@@ -60,8 +62,13 @@ write_dataset <- function(dataset,
                           format = c("parquet", "feather", "arrow", "ipc"),
                           partitioning = dplyr::group_vars(dataset),
                           basename_template = paste0("part-{i}.", as.character(format)),
-                          hive_style = TRUE,
+                          hive_style = TRUE, max_partitions = 1024L,

Review comment:
       Minor: in the .R code, we should follow the style here with each argument on a new line.

##########
File path: r/R/dataset-write.R
##########
@@ -60,8 +62,13 @@ write_dataset <- function(dataset,
                           format = c("parquet", "feather", "arrow", "ipc"),
                           partitioning = dplyr::group_vars(dataset),
                           basename_template = paste0("part-{i}.", as.character(format)),
-                          hive_style = TRUE,
+                          hive_style = TRUE, max_partitions = 1024L,
                           ...) {
+  stopifnot(
+    max_partitions == round(max_partitions, 0),
+    max_partitions == abs(max_partitions),
+    !is.null(max_partitions)
+  )

Review comment:
       Have you tried to leave this checking off and seen what errors the c++ code returns? If those errors are reasonable, we should use them instead of writing our own here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on pull request #9972: ARROW-12325: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
jonkeane commented on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-817853983


   I don't know if this is a feature or a bug of the comment bot, but it looks like your `autotune` failed https://github.com/apache/arrow/runs/2310513219?check_suite_focus=true
   
   I know there was some turbulence with crossbow moving into archery recently (though I don't think this is related, it's possible). You could try that command again, or just add the new line it looks like it's complaining about in the lint job output.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-818065644


   https://issues.apache.org/jira/browse/ARROW-12315


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachamaltese commented on pull request #9972: ARROW-12325: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachamaltese commented on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-817045499


   @github-actions crossbow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachamaltese commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachamaltese commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r612693329



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -1778,3 +1778,60 @@ test_that("Collecting zero columns from a dataset doesn't return entire dataset"
     c(32, 0)
   )
 })
+
+# see https://issues.apache.org/jira/browse/ARROW-12315
+test_that("Max partitions fails with non-integer values and less than required partitions values", {
+  skip_if_not_available("parquet")
+  tmp <- tempfile()
+
+  # this example needs 3 partitions
+
+  # max_partitions = chr => error
+  expect_error(
+    mtcars %>%
+      group_by(cyl) %>%
+      write_dataset(tmp, format = "parquet", max_partitions = "foobar")
+  )
+
+  # max_partitions < 3 => error
+  expect_error(
+    mtcars %>%
+      group_by(cyl) %>%
+      write_dataset(tmp, format = "parquet", max_partitions = -3)
+  )
+
+  # max_partitions < 3 => error
+  expect_error(
+    mtcars %>%
+      group_by(cyl) %>%
+      write_dataset(tmp, format = "parquet", max_partitions = 1)
+  )

Review comment:
       the error in this case is 
   ```
    Error: Invalid: Fragment would be written into 3 partitions. This exceeds the maximum of 1 
   ```
   
   I think this is quite clear




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r746550848



##########
File path: r/R/dataset-write.R
##########
@@ -136,9 +139,13 @@ write_dataset <- function(dataset,
   existing_data_behavior_opts <- c("delete_matching", "overwrite", "error")
   existing_data_behavior <- match(match.arg(existing_data_behavior), existing_data_behavior_opts) - 1L
 
+  if (!is_integerish(max_partitions) || is.na(max_partitions) || max_partitions < 0) {

Review comment:
       ```suggestion
     if (!is_integerish(max_partitions, n = 1) || is.na(max_partitions) || max_partitions < 0) {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r612729965



##########
File path: r/R/dataset-write.R
##########
@@ -60,8 +62,13 @@ write_dataset <- function(dataset,
                           format = c("parquet", "feather", "arrow", "ipc"),
                           partitioning = dplyr::group_vars(dataset),
                           basename_template = paste0("part-{i}.", as.character(format)),
-                          hive_style = TRUE,
+                          hive_style = TRUE, max_partitions = 1024L,
                           ...) {
+  stopifnot(
+    max_partitions == round(max_partitions, 0),
+    max_partitions == abs(max_partitions),
+    !is.null(max_partitions)
+  )

Review comment:
       Can you tell if that's a feature or a bug in c++? This partitions parameter is pretty low-level so I think we should stick with what c++ allows (and fix this if it's a bug in c++, though I'm not certain it is)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachamaltese removed a comment on pull request #9972: ARROW-12325: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachamaltese removed a comment on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-817005013


   @github-actions autotune


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane closed pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
jonkeane closed pull request #9972:
URL: https://github.com/apache/arrow/pull/9972


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachamaltese commented on pull request #9972: ARROW-12325: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachamaltese commented on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-817044889


   github-actions autotune


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r637918633



##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -450,6 +450,12 @@ Status WriteNextBatch(WriteState& state, const std::shared_ptr<Fragment>& fragme
   ARROW_ASSIGN_OR_RAISE(auto groups, state.write_options.partitioning->Partition(batch));
   batch.reset();  // drop to hopefully conserve memory
 
+  if (static_cast<int>(state.write_options.max_partitions) < static_cast<int>(0)) {

Review comment:
       There's no need to cast here since max_partitions is already an int. (The other place had to cast since one side was an int and the other side was size_t.)

##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -450,6 +450,12 @@ Status WriteNextBatch(WriteState& state, const std::shared_ptr<Fragment>& fragme
   ARROW_ASSIGN_OR_RAISE(auto groups, state.write_options.partitioning->Partition(batch));
   batch.reset();  // drop to hopefully conserve memory
 
+  if (static_cast<int>(state.write_options.max_partitions) < static_cast<int>(0)) {
+    return Status::Invalid("A number of ", state.write_options.max_partitions,
+                           " partitions is not feasible.",
+                           " Try with max_partitions = 200, 100 or any other positive integer.");

Review comment:
       ```suggestion
       return Status::Invalid("max_partitions must be positive (was ",
                              state.write_options.max_partitions, ")");
   ```

##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -450,6 +450,12 @@ Status WriteNextBatch(WriteState& state, const std::shared_ptr<Fragment>& fragme
   ARROW_ASSIGN_OR_RAISE(auto groups, state.write_options.partitioning->Partition(batch));
   batch.reset();  // drop to hopefully conserve memory
 
+  if (static_cast<int>(state.write_options.max_partitions) < static_cast<int>(0)) {

Review comment:
       ```suggestion
     if (state.write_options.max_partitions <= 0) {
   ```

##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -450,6 +450,12 @@ Status WriteNextBatch(WriteState& state, const std::shared_ptr<Fragment>& fragme
   ARROW_ASSIGN_OR_RAISE(auto groups, state.write_options.partitioning->Partition(batch));
   batch.reset();  // drop to hopefully conserve memory
 
+  if (static_cast<int>(state.write_options.max_partitions) < static_cast<int>(0)) {

Review comment:
       Also, I'd probably put this validation in WriteInternal() below so it gets checked immediately.

##########
File path: r/R/dataset-write.R
##########
@@ -40,6 +40,8 @@
 #' will yield `"part-0.feather", ...`.
 #' @param hive_style logical: write partition segments as Hive-style
 #' (`key1=value1/key2=value2/file.ext`) or as just bare values. Default is `TRUE`.
+#' @param max_partitions maximum number of partitions any batch may be
+#' written into. Default is default 1024L (integer).

Review comment:
       ```suggestion
   #' written into. Default is 1024L (integer).
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-965213525


   Benchmark runs are scheduled for baseline = f51dc3475273d2cb5f571b534283faf892a4d701 and contender = 3c1f702e0542290c089b94bdf24e8315e1f95655. 3c1f702e0542290c089b94bdf24e8315e1f95655 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/b355e096096145128976c30dc0f2eca6...fbf5ce4aff3c4efabeff619fd30f449d/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/1cc4faada3244badbb70284d7daf60cb...6c77f74992144b15b77746189950c26b/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/dc3ac8fe359041f4bf665c45aa9a9b57...553c3156a5b245378e3c5ea40887e9de/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-965213525


   Benchmark runs are scheduled for baseline = f51dc3475273d2cb5f571b534283faf892a4d701 and contender = 3c1f702e0542290c089b94bdf24e8315e1f95655. 3c1f702e0542290c089b94bdf24e8315e1f95655 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/b355e096096145128976c30dc0f2eca6...fbf5ce4aff3c4efabeff619fd30f449d/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/1cc4faada3244badbb70284d7daf60cb...6c77f74992144b15b77746189950c26b/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/dc3ac8fe359041f4bf665c45aa9a9b57...553c3156a5b245378e3c5ea40887e9de/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachamaltese commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachamaltese commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r612692056



##########
File path: r/R/dataset-write.R
##########
@@ -60,8 +62,13 @@ write_dataset <- function(dataset,
                           format = c("parquet", "feather", "arrow", "ipc"),
                           partitioning = dplyr::group_vars(dataset),
                           basename_template = paste0("part-{i}.", as.character(format)),
-                          hive_style = TRUE,
+                          hive_style = TRUE, max_partitions = 1024L,
                           ...) {
+  stopifnot(
+    max_partitions == round(max_partitions, 0),
+    max_partitions == abs(max_partitions),
+    !is.null(max_partitions)
+  )

Review comment:
       yes, I wrote that because I can pass max_partitions = -3
   and it's gonna create 3 partitions, which is very confusing
   also, NA works, and goes to the default 1024L, but NULL fails




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachamaltese commented on pull request #9972: ARROW-12325: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachamaltese commented on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-817045792


   @github-actions autotune


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9972: ARROW-12325: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-817045602


   ```
   Usage: @github-actions crossbow [OPTIONS] COMMAND [ARGS]...
   
     Trigger crossbow builds for this pull request
   
   Options:
     -c, --crossbow TEXT  Crossbow repository on github to use
     --help               Show this message and exit.
   
   Commands:
     submit  Submit crossbow testing tasks.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r612730467



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -1778,3 +1778,60 @@ test_that("Collecting zero columns from a dataset doesn't return entire dataset"
     c(32, 0)
   )
 })
+
+# see https://issues.apache.org/jira/browse/ARROW-12315
+test_that("Max partitions fails with non-integer values and less than required partitions values", {
+  skip_if_not_available("parquet")
+  tmp <- tempfile()
+
+  # this example needs 3 partitions
+
+  # max_partitions = chr => error
+  expect_error(
+    mtcars %>%
+      group_by(cyl) %>%
+      write_dataset(tmp, format = "parquet", max_partitions = "foobar")
+  )

Review comment:
       And like I mention below: if we are using the error handling from c++, we don't need to test it in R as well if it has tests in c++ and there's no other indirection going on.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachamaltese commented on pull request #9972: ARROW-12325: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachamaltese commented on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-817044785


   @github-actions autotune


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachamaltese commented on pull request #9972: ARROW-12325: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachamaltese commented on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-817005013


   @github-actions autotune


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r612729383



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -1778,3 +1778,60 @@ test_that("Collecting zero columns from a dataset doesn't return entire dataset"
     c(32, 0)
   )
 })
+
+# see https://issues.apache.org/jira/browse/ARROW-12315
+test_that("Max partitions fails with non-integer values and less than required partitions values", {
+  skip_if_not_available("parquet")
+  tmp <- tempfile()
+
+  # this example needs 3 partitions
+
+  # max_partitions = chr => error
+  expect_error(
+    mtcars %>%
+      group_by(cyl) %>%
+      write_dataset(tmp, format = "parquet", max_partitions = "foobar")
+  )
+
+  # max_partitions < 3 => error
+  expect_error(
+    mtcars %>%
+      group_by(cyl) %>%
+      write_dataset(tmp, format = "parquet", max_partitions = -3)
+  )
+
+  # max_partitions < 3 => error
+  expect_error(
+    mtcars %>%
+      group_by(cyl) %>%
+      write_dataset(tmp, format = "parquet", max_partitions = 1)
+  )

Review comment:
       Yeah, this looks fine too me, let's use the c++ error. If the error is coming directly from c++ (and it's tested there), we don't actually need to test it here as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachamaltese commented on pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachamaltese commented on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-817873691


   @github-actions autotune


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-965213525


   Benchmark runs are scheduled for baseline = f51dc3475273d2cb5f571b534283faf892a4d701 and contender = 3c1f702e0542290c089b94bdf24e8315e1f95655. 3c1f702e0542290c089b94bdf24e8315e1f95655 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/b355e096096145128976c30dc0f2eca6...fbf5ce4aff3c4efabeff619fd30f449d/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/1cc4faada3244badbb70284d7daf60cb...6c77f74992144b15b77746189950c26b/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/dc3ac8fe359041f4bf665c45aa9a9b57...553c3156a5b245378e3c5ea40887e9de/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachamaltese removed a comment on pull request #9972: ARROW-12325: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachamaltese removed a comment on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-817044889






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r746596046



##########
File path: r/R/dataset-write.R
##########
@@ -136,9 +139,13 @@ write_dataset <- function(dataset,
   existing_data_behavior_opts <- c("delete_matching", "overwrite", "error")
   existing_data_behavior <- match(match.arg(existing_data_behavior), existing_data_behavior_opts) - 1L
 
+  if (!is_integerish(max_partitions) || is.na(max_partitions) || max_partitions < 0) {

Review comment:
       TIL about the `n` argument, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r660563003



##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -451,7 +451,12 @@ Status WriteNextBatch(WriteState* state, const std::shared_ptr<Fragment>& fragme
   ARROW_ASSIGN_OR_RAISE(auto groups, state->write_options.partitioning->Partition(batch));
   batch.reset();  // drop to hopefully conserve memory
 
-  if (groups.batches.size() > static_cast<size_t>(state->write_options.max_partitions)) {
+  if (state.write_options.max_partitions <= 0) {
+    return Status::Invalid("max_partitions must be positive (was ",
+                           state.write_options.max_partitions, ")");
+  }
+
+  if (groups.batches.size() > static_cast<size_t>(state.write_options.max_partitions)) {

Review comment:
       ```suggestion
     if (state->write_options.max_partitions <= 0) {
       return Status::Invalid("max_partitions must be positive (was ",
                              state->write_options.max_partitions, ")");
     }
   
     if (groups.batches.size() > static_cast<size_t>(state->write_options.max_partitions)) {
   ```

##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -451,7 +451,12 @@ Status WriteNextBatch(WriteState* state, const std::shared_ptr<Fragment>& fragme
   ARROW_ASSIGN_OR_RAISE(auto groups, state->write_options.partitioning->Partition(batch));
   batch.reset();  // drop to hopefully conserve memory
 
-  if (groups.batches.size() > static_cast<size_t>(state->write_options.max_partitions)) {
+  if (state.write_options.max_partitions <= 0) {
+    return Status::Invalid("max_partitions must be positive (was ",
+                           state.write_options.max_partitions, ")");
+  }
+
+  if (groups.batches.size() > static_cast<size_t>(state.write_options.max_partitions)) {

Review comment:
       Just to fix the build errors.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachadotdev commented on pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachadotdev commented on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-986350945


   thanks a lot @jonkeane and @paleolimbot for polishing the remaining bits of this one!!
   glad to have made something useful on my side and help to add a useful feature (and bug detection) on the R/C++ side :D


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pachamaltese commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
pachamaltese commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r612806301



##########
File path: r/R/dataset-write.R
##########
@@ -60,8 +62,13 @@ write_dataset <- function(dataset,
                           format = c("parquet", "feather", "arrow", "ipc"),
                           partitioning = dplyr::group_vars(dataset),
                           basename_template = paste0("part-{i}.", as.character(format)),
-                          hive_style = TRUE,
+                          hive_style = TRUE, max_partitions = 1024L,
                           ...) {
+  stopifnot(
+    max_partitions == round(max_partitions, 0),
+    max_partitions == abs(max_partitions),
+    !is.null(max_partitions)
+  )

Review comment:
       related https://issues.apache.org/jira/browse/ARROW-12373




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9972: ARROW-12325: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-817001196


   https://issues.apache.org/jira/browse/ARROW-12325


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r660563003



##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -451,7 +451,12 @@ Status WriteNextBatch(WriteState* state, const std::shared_ptr<Fragment>& fragme
   ARROW_ASSIGN_OR_RAISE(auto groups, state->write_options.partitioning->Partition(batch));
   batch.reset();  // drop to hopefully conserve memory
 
-  if (groups.batches.size() > static_cast<size_t>(state->write_options.max_partitions)) {
+  if (state.write_options.max_partitions <= 0) {
+    return Status::Invalid("max_partitions must be positive (was ",
+                           state.write_options.max_partitions, ")");
+  }
+
+  if (groups.batches.size() > static_cast<size_t>(state.write_options.max_partitions)) {

Review comment:
       ```suggestion
     if (state->write_options.max_partitions <= 0) {
       return Status::Invalid("max_partitions must be positive (was ",
                              state->write_options.max_partitions, ")");
     }
   
     if (groups.batches.size() > static_cast<size_t>(state->write_options.max_partitions)) {
   ```

##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -451,7 +451,12 @@ Status WriteNextBatch(WriteState* state, const std::shared_ptr<Fragment>& fragme
   ARROW_ASSIGN_OR_RAISE(auto groups, state->write_options.partitioning->Partition(batch));
   batch.reset();  // drop to hopefully conserve memory
 
-  if (groups.batches.size() > static_cast<size_t>(state->write_options.max_partitions)) {
+  if (state.write_options.max_partitions <= 0) {
+    return Status::Invalid("max_partitions must be positive (was ",
+                           state.write_options.max_partitions, ")");
+  }
+
+  if (groups.batches.size() > static_cast<size_t>(state.write_options.max_partitions)) {

Review comment:
       Just to fix the build errors.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #9972: ARROW-12315: [R] add max_partitions argument to write_dataset()

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#issuecomment-965213525


   Benchmark runs are scheduled for baseline = f51dc3475273d2cb5f571b534283faf892a4d701 and contender = 3c1f702e0542290c089b94bdf24e8315e1f95655. 3c1f702e0542290c089b94bdf24e8315e1f95655 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/b355e096096145128976c30dc0f2eca6...fbf5ce4aff3c4efabeff619fd30f449d/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/1cc4faada3244badbb70284d7daf60cb...6c77f74992144b15b77746189950c26b/)
   [Finished :arrow_down:0.18% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/dc3ac8fe359041f4bf665c45aa9a9b57...553c3156a5b245378e3c5ea40887e9de/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org