You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/03 12:16:54 UTC

[GitHub] [arrow] Crystrix opened a new pull request #10443: ARROW-12942: [C++][Compute] Fix incorrect result of Arrow compute hash_min_max with chunked array

Crystrix opened a new pull request #10443:
URL: https://github.com/apache/arrow/pull/10443


   If there are new groups in the subsequent chunks of a chunked array, the result of Arrow compute hash_min_max is incorrect.
   For example, a table with two chunks, the second chunk has a new group key
   ```
   First chunk: {"argument": 1, "key": 0},
   Second chunk: {"argument": 0,  "key": 1}
   ```
   the result of hash_min_max by "key" with such data is
   ```
   [{"min": null, "max": null}, 0],
   [{"min": 0, "max": 0}, 1]
   ```
   But it should be
   ```
   [{"min": 1, "max": 1}, 0],
   [{"min": 0, "max": 0}, 1]
   ```
   
   The root cause is that `has_values_` and `has_nulls_` are `BufferBuilder` which has no `_size` and `capacity_` property.  So `MakeResizeImpl` function init a `TypedBufferBuilder` with the `BufferBuilder` with `_size` and  `capacity_` of 0. After the first chunk is processed, in the consumption of the second chunk,  `MakeResizeImpl` is called to reserve enough space for the next chunk. Then as the `_size` and  `capacity_` are zero, the original `BufferBuilder` is overwritten by `Reserve`, and outputs an incorrect result.
   
   This MR separates `has_values_` and `has_nulls_` with a `TypedBufferBuilder<bool>` which can keep the `_size` and `capacity_` property. Then in the consumption of the second chunk, the space of `has_values_` and `has_nulls_` is reserved after the data of the first chunk.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #10443: ARROW-12942: [C++][Compute] Fix incorrect result of Arrow compute hash_min_max with a chunked array

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #10443:
URL: https://github.com/apache/arrow/pull/10443#discussion_r646816751



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -1000,16 +1010,16 @@ struct GroupedMinMaxImpl : public GroupedAggregator {
 
     mins_ = BufferBuilder(ctx->memory_pool());
     maxes_ = BufferBuilder(ctx->memory_pool());
-    has_values_ = BufferBuilder(ctx->memory_pool());
-    has_nulls_ = BufferBuilder(ctx->memory_pool());
+    has_values_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    has_nulls_ = TypedBufferBuilder<bool>(ctx->memory_pool());
 
     GetImpl get_impl;
     RETURN_NOT_OK(VisitTypeInline(*input_type, &get_impl));
 
     consume_impl_ = std::move(get_impl.consume_impl);
     resize_min_impl_ = std::move(get_impl.resize_min_impl);
     resize_max_impl_ = std::move(get_impl.resize_max_impl);
-    resize_bitmap_impl_ = MakeResizeImpl(false);
+    resize_bitmap_impl_ = MakeResizeImplForBitmap(false);

Review comment:
       I think it's unnecessary to have a separate `std::function` for resizing the bitmaps, let's just linline the calls to TypedBufferBuilder::Append inside the MaybeReserve call




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz closed pull request #10443: ARROW-12942: [C++][Compute] Fix incorrect result of Arrow compute hash_min_max with a chunked array

Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #10443:
URL: https://github.com/apache/arrow/pull/10443


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10443: ARROW-12942: [C++][Compute] Fix incorrect result of Arrow compute hash_min_max with a chunked array

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10443:
URL: https://github.com/apache/arrow/pull/10443#issuecomment-853824476


   https://issues.apache.org/jira/browse/ARROW-12942


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10443: ARROW-12942: [C++][Compute] Fix incorrect result of Arrow compute hash_min_max with a chunked array

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10443:
URL: https://github.com/apache/arrow/pull/10443#issuecomment-853824476


   https://issues.apache.org/jira/browse/ARROW-12942


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Crystrix commented on a change in pull request #10443: ARROW-12942: [C++][Compute] Fix incorrect result of Arrow compute hash_min_max with a chunked array

Posted by GitBox <gi...@apache.org>.
Crystrix commented on a change in pull request #10443:
URL: https://github.com/apache/arrow/pull/10443#discussion_r647126649



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -1000,16 +1010,16 @@ struct GroupedMinMaxImpl : public GroupedAggregator {
 
     mins_ = BufferBuilder(ctx->memory_pool());
     maxes_ = BufferBuilder(ctx->memory_pool());
-    has_values_ = BufferBuilder(ctx->memory_pool());
-    has_nulls_ = BufferBuilder(ctx->memory_pool());
+    has_values_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    has_nulls_ = TypedBufferBuilder<bool>(ctx->memory_pool());
 
     GetImpl get_impl;
     RETURN_NOT_OK(VisitTypeInline(*input_type, &get_impl));
 
     consume_impl_ = std::move(get_impl.consume_impl);
     resize_min_impl_ = std::move(get_impl.resize_min_impl);
     resize_max_impl_ = std::move(get_impl.resize_max_impl);
-    resize_bitmap_impl_ = MakeResizeImpl(false);
+    resize_bitmap_impl_ = MakeResizeImplForBitmap(false);

Review comment:
       Sure, now `resize_bitmap_impl_` is replaced with `Append` in `MaybeReserve`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org