You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/04 14:17:13 UTC

[GitHub] [arrow] jvanstraten opened a new pull request, #13509: ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns

jvanstraten opened a new pull request, #13509:
URL: https://github.com/apache/arrow/pull/13509

   The min/max aggregate compute kernels seemed to discard their state between partitions, so they would only aggregate the last partition they see (in each thread).
   
   This is the simplest change I could come up with to fix this, but honestly I'm not sure why the `local` variable even exists. It seems to me it could just be replaced with `this->state` directly, since there doesn't seem to be any failure path where `this->state` isn't updated from `local`. Am I missing something?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #13509:
URL: https://github.com/apache/arrow/pull/13509#discussion_r917049492


##########
cpp/src/arrow/compute/exec/plan_test.cc:
##########
@@ -933,6 +933,38 @@ TEST(ExecPlanExecution, SourceGroupedSum) {
   }
 }
 
+TEST(ExecPlanExecution, SourceMinMaxScalar) {
+  for (bool parallel : { false, true }) {
+    SCOPED_TRACE(parallel ? "parallel/merged" : "serial");
+
+    auto input = MakeGroupableBatches(/*multiplicity=*/parallel ? 100 : 1);
+    auto minmax_opts = std::make_shared<ScalarAggregateOptions>();
+    auto expected_value = StructScalar::Make(

Review Comment:
   https://github.com/apache/arrow/blob/8042f001fbadd2d4bb395e42a94c76fe7ee01d3b/cpp/src/arrow/ipc/json_simple.cc#L659-L661
   
   should be `{"min": -8, "max": 12}`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1179000158

   Something is very off with the merge here. Can we do something like cherry-pick the commits onto a fresh branch and then force-push?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1179187482

   Note the R lints https://github.com/apache/arrow/runs/7253994871?check_suite_focus=true 
   
   ```
   
   > lintr::lint_package('/arrow/r')
   Warning: file=tests/testthat/test-dataset.R,line=622,col=30,[infix_spaces_linter] Put spaces around all infix operators.
   INFO:archery:Running Docker linter
   Warning: file=tests/testthat/test-dataset.R,line=624,col=27,[function_left_parentheses_linter] Remove spaces before the left parenthesis in a function call.
   Warning: file=tests/testthat/test-dataset.R,line=629,col=8,[pipe_continuation_linter] `%>%` should always have a space before it and a new line after it, unless the full pipeline fits on one line.
   Warning: file=tests/testthat/test-dataset.R,line=632,col=26,[single_quotes_linter] Only use double-quotes.
   Warning: file=tests/testthat/test-dataset.R,line=632,col=51,[infix_spaces_linter] Put spaces around all infix operators.
   Warning: file=tests/testthat/test-dataset.R,line=632,col=52,[single_quotes_linter] Only use double-quotes.
   Warning: file=tests/testthat/test-dataset.R,line=635,col=27,[function_left_parentheses_linter] Remove spaces before the left parenthesis in a function call.
   Warning: file=tests/testthat/test-dataset.R,line=640,col=8,[pipe_continuation_linter] `%>%` should always have a space before it and a new line after it, unless the full pipeline fits on one line.
   > 
   >
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jvanstraten commented on pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
jvanstraten commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1175360041

   I've relinked this to ARROW-16904 since it's more indicative of what's being fixed and so ARROW-16700 can be used for the issue related to guarantee expressions, but feel free to overrule this PR with @drin's version once they've worked out test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jvanstraten commented on pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
jvanstraten commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1179105055

   Should be fixed. I guess a merge commit was lost somewhere along the lines and git(hub) got confused.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #13509: ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1175195111

   The behavior you're seeing stems from this: https://github.com/apache/arrow/blob/3d6240c1ee7802829d2ed209f4135906e9413915/cpp/src/arrow/compute/exec.cc#L1124-L1177
   
   We could/should fix that up too (perhaps in a separate JIRA)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] drin commented on pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
drin commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1175525582

   Sorry, I just caught up to the various threads leading here.
   
   I can just add onto this PR, if you don't mind @jvanstraten .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jvanstraten commented on pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
jvanstraten commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1175564509

   Also fine with me. I assume you mean to PR into my branch so it ends up in here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
wjones127 commented on code in PR #13509:
URL: https://github.com/apache/arrow/pull/13509#discussion_r917097365


##########
r/tests/testthat/test-dataset.R:
##########
@@ -618,6 +618,33 @@ test_that("UnionDataset handles InMemoryDatasets", {
   expect_equal(actual, expected)
 })
 
+test_that("scalar aggregates with many batches", {
+  test_data <- data.frame(val = 1:1e7)
+  expected_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      test_data                              %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )
+
+  ds_tmpfile <- tempfile("test-aggregate", fileext = ".parquet")
+  arrow::write_parquet(test_data, ds_tmpfile)
+  actual_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      arrow::open_dataset(ds_tmpfile)        %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )

Review Comment:
   nit: if we aren't using iter_idx, can use replicate
   ```suggestion
     actual_result_distr <- (
       replicate(100, {
         arrow::open_dataset(ds_tmpfile)        %>%
           dplyr::summarise(min_val = min(val)) %>%
           dplyr::collect()                     %>%
           dplyr::pull(min_val)
       }) %>%
         table()
     )
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] drin commented on a diff in pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
drin commented on code in PR #13509:
URL: https://github.com/apache/arrow/pull/13509#discussion_r917100970


##########
r/tests/testthat/test-dataset.R:
##########
@@ -618,6 +618,33 @@ test_that("UnionDataset handles InMemoryDatasets", {
   expect_equal(actual, expected)
 })
 
+test_that("scalar aggregates with many batches", {
+  test_data <- data.frame(val = 1:1e7)
+  expected_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      test_data                              %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )
+
+  ds_tmpfile <- tempfile("test-aggregate", fileext = ".parquet")
+  arrow::write_parquet(test_data, ds_tmpfile)
+  actual_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      arrow::open_dataset(ds_tmpfile)        %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )

Review Comment:
   Nice! Neal's suggestion included this change, so I took that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13509: ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1173873922

   https://issues.apache.org/jira/browse/ARROW-16700


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] drin commented on pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
drin commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1179252960

   > Note the R lints https://github.com/apache/arrow/runs/7253994871?check_suite_focus=true
   > 
   > ```
   > 
   > > lintr::lint_package('/arrow/r')
   > Warning: file=tests/testthat/test-dataset.R,line=622,col=30,[infix_spaces_linter] Put spaces around all infix operators.
   > INFO:archery:Running Docker linter
   > Warning: file=tests/testthat/test-dataset.R,line=624,col=27,[function_left_parentheses_linter] Remove spaces before the left parenthesis in a function call.
   > Warning: file=tests/testthat/test-dataset.R,line=629,col=8,[pipe_continuation_linter] `%>%` should always have a space before it and a new line after it, unless the full pipeline fits on one line.
   > Warning: file=tests/testthat/test-dataset.R,line=632,col=26,[single_quotes_linter] Only use double-quotes.
   > Warning: file=tests/testthat/test-dataset.R,line=632,col=51,[infix_spaces_linter] Put spaces around all infix operators.
   > Warning: file=tests/testthat/test-dataset.R,line=632,col=52,[single_quotes_linter] Only use double-quotes.
   > Warning: file=tests/testthat/test-dataset.R,line=635,col=27,[function_left_parentheses_linter] Remove spaces before the left parenthesis in a function call.
   > Warning: file=tests/testthat/test-dataset.R,line=640,col=8,[pipe_continuation_linter] `%>%` should always have a space before it and a new line after it, unless the full pipeline fits on one line.
   > > 
   > >
   > ```
   
   thanks. I tried to match style because i had trouble running the linter. I'll fix these and try to get it running.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
wjones127 commented on code in PR #13509:
URL: https://github.com/apache/arrow/pull/13509#discussion_r917096886


##########
r/tests/testthat/test-dataset.R:
##########
@@ -618,6 +618,33 @@ test_that("UnionDataset handles InMemoryDatasets", {
   expect_equal(actual, expected)
 })
 
+test_that("scalar aggregates with many batches", {
+  test_data <- data.frame(val = 1:1e7)
+  expected_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      test_data                              %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )
+
+  ds_tmpfile <- tempfile("test-aggregate", fileext = ".parquet")
+  arrow::write_parquet(test_data, ds_tmpfile)
+  actual_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      arrow::open_dataset(ds_tmpfile)        %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )
+
+  expect_equal(actual_result_distr, expected_result_distr)

Review Comment:
   I'm confused by this test. Shouldn't the `min(val)` always resolve to 1? Why can't we test that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #13509:
URL: https://github.com/apache/arrow/pull/13509#discussion_r916983041


##########
cpp/src/arrow/compute/exec/plan_test.cc:
##########
@@ -933,6 +933,38 @@ TEST(ExecPlanExecution, SourceGroupedSum) {
   }
 }
 
+TEST(ExecPlanExecution, SourceMinMaxScalar) {
+  for (bool parallel : { false, true }) {
+    SCOPED_TRACE(parallel ? "parallel/merged" : "serial");
+
+    auto input = MakeGroupableBatches(/*multiplicity=*/parallel ? 100 : 1);
+    auto minmax_opts = std::make_shared<ScalarAggregateOptions>();
+    auto expected_value = StructScalar::Make(

Review Comment:
   nit, but doesn't ScalarFromJSON handle StructScalar directly?



##########
cpp/src/arrow/compute/exec/options.h:
##########
@@ -109,6 +109,10 @@ class ARROW_EXPORT ProjectNodeOptions : public ExecNodeOptions {
 };
 
 /// \brief Make a node which aggregates input batches, optionally grouped by keys.
+///
+/// If the keys attribute is a non-empty vector, then each aggregate in `aggregates` is
+/// expected to be a HashAggregate function. If the keys attribute is an empty vector,
+/// then each aggregate is assumed to be a ScalarAggregate function.

Review Comment:
   :heavy_check_mark: thank you!



##########
cpp/src/arrow/compute/exec/plan_test.cc:
##########
@@ -933,6 +933,38 @@ TEST(ExecPlanExecution, SourceGroupedSum) {
   }
 }
 
+TEST(ExecPlanExecution, SourceMinMaxScalar) {

Review Comment:
   ```suggestion
   TEST(ExecPlanExecution, SourceMinMaxScalar) {
     // Regression test for ARROW-16904
   ```



##########
cpp/src/arrow/compute/exec/plan_test.cc:
##########
@@ -933,6 +933,38 @@ TEST(ExecPlanExecution, SourceGroupedSum) {
   }
 }
 
+TEST(ExecPlanExecution, SourceMinMaxScalar) {

Review Comment:
   Is there a task filed for testing all aggregates in this fashion?



##########
cpp/src/arrow/compute/exec/plan_test.cc:
##########
@@ -933,6 +933,38 @@ TEST(ExecPlanExecution, SourceGroupedSum) {
   }
 }
 
+TEST(ExecPlanExecution, SourceMinMaxScalar) {

Review Comment:
   Preferably we'd update the existing aggregate tests to run all aggregates "both ways" instead of manually duplicating tests like this on a case-by-case basis



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1175358050

   https://issues.apache.org/jira/browse/ARROW-16904


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] drin commented on pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
drin commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1175588264

   yep!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jvanstraten commented on pull request #13509: ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns

Posted by GitBox <gi...@apache.org>.
jvanstraten commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1175169931

   Chunked arrays are tested [here](https://github.com/apache/arrow/blob/3d6240c1ee7802829d2ed209f4135906e9413915/cpp/src/arrow/compute/kernels/aggregate_test.cc#L1566-L1568), but, as cleaned up from my debug prints for `chunked_input1`, the call pattern there is
   
   ```
   auto min_max_impl_1 = MinMaxImpl(...)
   auto min_max_impl_2 = MinMaxImpl(...)
   min_max_impl_2.Consume([5,1,2,3,4]) -> (1, 5)
   min_max_impl_1.MergeFrom(min_max_impl_2) -> (1, 5)
   auto min_max_impl_3 = MinMaxImpl(...)
   min_max_impl_3.Consume([9,1,null,3,4]) -> (1, 9)
   min_max_impl_1.MergeFrom(min_max_impl_3) -> (1, 9)
   min_max_impl_1.Finalize() -> (1, 9)
   ```
   
   for which it does not matter whether `Consume()` overrides the previous state. The test cases aren't great anyway since the last chunk has the min and max values for each of them, but even if I'd swap one of them around you'd still get
   
   ```
   auto min_max_impl_1 = MinMaxImpl(...)
   auto min_max_impl_2 = MinMaxImpl(...)
   min_max_impl_2.Consume([9,1,null,3,4]) -> (1, 9)
   min_max_impl_1.MergeFrom(min_max_impl_2) -> (1, 9)
   auto min_max_impl_3 = MinMaxImpl(...)
   min_max_impl_3.Consume([5,1,2,3,4]) -> (1, 5)
   min_max_impl_1.MergeFrom(min_max_impl_3) -> (1, 9)
   min_max_impl_1.Finalize() -> (1, 9)
   ```
   
   I don't know why it's doing it this way. Seems rather inefficient to me. Evidently though, for larger-scale workloads it *does* call `Consume()` more than once before merging and throwing away each `MinMaxImpl` instance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1175358081

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] drin commented on pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
drin commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1178292197

   opened a draft PR to add to this one:
   
       https://github.com/jvanstraten/arrow/pull/5
   
   It looks a bit messy because of rebases, not sure how to easily improve that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on a diff in pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #13509:
URL: https://github.com/apache/arrow/pull/13509#discussion_r917097216


##########
r/tests/testthat/test-dataset.R:
##########
@@ -618,6 +618,33 @@ test_that("UnionDataset handles InMemoryDatasets", {
   expect_equal(actual, expected)
 })
 
+test_that("scalar aggregates with many batches", {
+  test_data <- data.frame(val = 1:1e7)
+  expected_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      test_data                              %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )
+
+  ds_tmpfile <- tempfile("test-aggregate", fileext = ".parquet")
+  arrow::write_parquet(test_data, ds_tmpfile)
+  actual_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      arrow::open_dataset(ds_tmpfile)        %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )
+
+  expect_equal(actual_result_distr, expected_result_distr)
+})

Review Comment:
   Here's a simpler test (this fails on master but should pass with your fix)
   
   ```suggestion
   test_that("scalar aggregates with many batches (ARROW-16904)", {
     tf <- tempfile()
     write_parquet(data.frame(x = 1:100), tf, chunk_size = 20)
   
     ds <- open_dataset(tf)
     replicate(100, ds %>% summarize(min(x)) %>% pull())
   
     expect_true(all(replicate(100, ds %>% summarize(min(x)) %>% pull()) == 1))
     expect_true(all(replicate(100, ds %>% summarize(max(x)) %>% pull()) == 100))
   })
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #13509: ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1175195775

   I suppose it's done that way because of the `// TODO: implement parallelism` but I wonder if we want to just say, "use Acero for that", or if we do want to actually go implement parallelism


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] drin commented on a diff in pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
drin commented on code in PR #13509:
URL: https://github.com/apache/arrow/pull/13509#discussion_r917045351


##########
cpp/src/arrow/compute/exec/plan_test.cc:
##########
@@ -933,6 +933,38 @@ TEST(ExecPlanExecution, SourceGroupedSum) {
   }
 }
 
+TEST(ExecPlanExecution, SourceMinMaxScalar) {
+  for (bool parallel : { false, true }) {
+    SCOPED_TRACE(parallel ? "parallel/merged" : "serial");
+
+    auto input = MakeGroupableBatches(/*multiplicity=*/parallel ? 100 : 1);
+    auto minmax_opts = std::make_shared<ScalarAggregateOptions>();
+    auto expected_value = StructScalar::Make(

Review Comment:
   I tried looking at the function and searching for usages but I couldn't figure it out. If you know how to do it, I can update it. I wasn't sure if a StructScalar is essentially an object (e.g. `{ -8, 12}` would be a 2 field struct)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] drin commented on a diff in pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
drin commented on code in PR #13509:
URL: https://github.com/apache/arrow/pull/13509#discussion_r917100523


##########
r/tests/testthat/test-dataset.R:
##########
@@ -618,6 +618,33 @@ test_that("UnionDataset handles InMemoryDatasets", {
   expect_equal(actual, expected)
 })
 
+test_that("scalar aggregates with many batches", {
+  test_data <- data.frame(val = 1:1e7)
+  expected_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      test_data                              %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )
+
+  ds_tmpfile <- tempfile("test-aggregate", fileext = ".parquet")
+  arrow::write_parquet(test_data, ds_tmpfile)
+  actual_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      arrow::open_dataset(ds_tmpfile)        %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )
+
+  expect_equal(actual_result_distr, expected_result_distr)

Review Comment:
   The implementation was trying to make sure it always resolves to 1. I guess it could have been done in a loop instead of gathering everything into a table to be inspected at the end.
   
   I took Neal's suggestion which simplifies the test body by doing this (using `replicate` and `all`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm merged pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
lidavidm merged PR #13509:
URL: https://github.com/apache/arrow/pull/13509


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #13509: ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1175194160

   My bad, I was looking at the hash aggregate tests! 
   
   I suppose the 'right' way to test it is to construct an ExecPlan and feed the data through. There are some tests in plan_test.cc but it doesn't have much coverage of the kernels themselves. We may need some parameterization/a helper to test "both ways" of calling aggregates in much the same way hash_aggregate_test.cc does.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13509: ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1175256838

   > I suppose the 'right' way to test it is to construct an ExecPlan and feed the data through. There are some tests in plan_test.cc but it doesn't have much coverage of the kernels themselves. We may need some parameterization/a helper to test "both ways" of calling aggregates in much the same way hash_aggregate_test.cc does.
   
   @drin is working on this for min/max.
   
   I think there is probably more interest/priority in making the exec plan case work well vs the chunked array case.  If you are doing compute on chunked arrays and the entire chunked array fits in memory, then it is probably sufficient to just concatenate the chunked array into a single array at the beginning of your compute work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] drin commented on a diff in pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
drin commented on code in PR #13509:
URL: https://github.com/apache/arrow/pull/13509#discussion_r917044391


##########
cpp/src/arrow/compute/exec/plan_test.cc:
##########
@@ -933,6 +933,38 @@ TEST(ExecPlanExecution, SourceGroupedSum) {
   }
 }
 
+TEST(ExecPlanExecution, SourceMinMaxScalar) {

Review Comment:
   I haven't created such a task, but I was planning on following this up with that, then I can also reference this issue + PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on a diff in pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #13509:
URL: https://github.com/apache/arrow/pull/13509#discussion_r917097216


##########
r/tests/testthat/test-dataset.R:
##########
@@ -618,6 +618,33 @@ test_that("UnionDataset handles InMemoryDatasets", {
   expect_equal(actual, expected)
 })
 
+test_that("scalar aggregates with many batches", {
+  test_data <- data.frame(val = 1:1e7)
+  expected_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      test_data                              %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )
+
+  ds_tmpfile <- tempfile("test-aggregate", fileext = ".parquet")
+  arrow::write_parquet(test_data, ds_tmpfile)
+  actual_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      arrow::open_dataset(ds_tmpfile)        %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )
+
+  expect_equal(actual_result_distr, expected_result_distr)
+})

Review Comment:
   Here's a simpler test--this fails on master (actually, the max is fine, the min is the problem) but should pass with your fix
   
   ```suggestion
   test_that("scalar aggregates with many batches (ARROW-16904)", {
     tf <- tempfile()
     write_parquet(data.frame(x = 1:100), tf, chunk_size = 20)
   
     ds <- open_dataset(tf)
     replicate(100, ds %>% summarize(min(x)) %>% pull())
   
     expect_true(all(replicate(100, ds %>% summarize(min(x)) %>% pull()) == 1))
     expect_true(all(replicate(100, ds %>% summarize(max(x)) %>% pull()) == 100))
   })
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org