You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/05 15:06:01 UTC

[GitHub] [arrow] jvanstraten commented on pull request #13509: ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns

jvanstraten commented on PR #13509:
URL: https://github.com/apache/arrow/pull/13509#issuecomment-1175169931

   Chunked arrays are tested [here](https://github.com/apache/arrow/blob/3d6240c1ee7802829d2ed209f4135906e9413915/cpp/src/arrow/compute/kernels/aggregate_test.cc#L1566-L1568), but, as cleaned up from my debug prints for `chunked_input1`, the call pattern there is
   
   ```
   auto min_max_impl_1 = MinMaxImpl(...)
   auto min_max_impl_2 = MinMaxImpl(...)
   min_max_impl_2.Consume([5,1,2,3,4]) -> (1, 5)
   min_max_impl_1.MergeFrom(min_max_impl_2) -> (1, 5)
   auto min_max_impl_3 = MinMaxImpl(...)
   min_max_impl_3.Consume([9,1,null,3,4]) -> (1, 9)
   min_max_impl_1.MergeFrom(min_max_impl_3) -> (1, 9)
   min_max_impl_1.Finalize() -> (1, 9)
   ```
   
   for which it does not matter whether `Consume()` overrides the previous state. The test cases aren't great anyway since the last chunk has the min and max values for each of them, but even if I'd swap one of them around you'd still get
   
   ```
   auto min_max_impl_1 = MinMaxImpl(...)
   auto min_max_impl_2 = MinMaxImpl(...)
   min_max_impl_2.Consume([9,1,null,3,4]) -> (1, 9)
   min_max_impl_1.MergeFrom(min_max_impl_2) -> (1, 9)
   auto min_max_impl_3 = MinMaxImpl(...)
   min_max_impl_3.Consume([5,1,2,3,4]) -> (1, 5)
   min_max_impl_1.MergeFrom(min_max_impl_3) -> (1, 9)
   min_max_impl_1.Finalize() -> (1, 9)
   ```
   
   I don't know why it's doing it this way. Seems rather inefficient to me. Evidently though, for larger-scale workloads it *does* call `Consume()` more than once before merging and throwing away each `MinMaxImpl` instance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org