You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/08 14:13:20 UTC

[GitHub] [arrow] dhruv9vats opened a new pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

dhruv9vats opened a new pull request #12368:
URL: https://github.com/apache/arrow/pull/12368


   Add a hash aggregate function that returns one value from each group.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm closed pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

lidavidm closed pull request #12368:
URL: https://github.com/apache/arrow/pull/12368


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r804579744



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2460,476 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+TEST(GroupBy, One) {

Review comment:
       This test can probably be deleted as tests below capture this.

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2460,476 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      AssertDatumsEqual(ArrayFromJSON(struct_({
+                                          field("hash_one", float64()),
+                                          field("hash_one", null()),
+                                          field("hash_one", boolean()),
+                                          field("key_0", int64()),
+                                      }),
+                                      R"([
+    [1.0,  null, true,  1],
+    [0.0,  null, false, 2],
+    [null, null, false, 3],
+    [4.0,  null, null,  null]
+  ])"),
+                        aggregated_and_grouped,
+                        /*verbose=*/true);
+    }
+  }
+}
+
+TEST(GroupBy, OneTypes) {
+  std::vector<std::shared_ptr<DataType>> types;
+  types.insert(types.end(), NumericTypes().begin(), NumericTypes().end());
+  types.insert(types.end(), TemporalTypes().begin(), TemporalTypes().end());
+  types.push_back(month_interval());
+
+  const std::vector<std::string> default_table = {R"([
+    [1,    1],
+    [null, 1]
+])",
+                                                  R"([
+    [0,    2],
+    [null, 3],
+    [3,    4],
+    [5,    4],
+    [4,    null],
+    [3,    1],
+    [0,    2]
+])",
+                                                  R"([
+    [0,    2],
+    [1,    null],
+    [null, 3]
+])"};
+
+  const std::vector<std::string> date64_table = {R"([
+    [86400000,  1],
+    [null,      1]
+])",
+                                                 R"([
+    [0,         2],
+    [null,      3],
+    [259200000, 4],
+    [432000000, 4],
+    [345600000, null],
+    [259200000, 1],
+    [0,         2]
+])",
+                                                 R"([
+    [0,         2],
+    [86400000,  null],
+    [null,      3]
+])"};
+
+  const std::string default_expected =
+      R"([
+    [1,    1],
+    [0,    2],
+    [null, 3],
+    [3,    4],
+    [4,    null]
+    ])";
+
+  const std::string date64_expected =
+      R"([
+    [86400000,  1],
+    [0,         2],
+    [null,      3],
+    [259200000, 4],
+    [345600000, null]
+    ])";
+
+  for (const auto& ty : types) {
+    SCOPED_TRACE(ty->ToString());
+    auto in_schema = schema({field("argument0", ty), field("key", int64())});
+    auto table =
+        TableFromJSON(in_schema, (ty->name() == "date64") ? date64_table : default_table);
+
+    ASSERT_OK_AND_ASSIGN(
+        Datum aggregated_and_grouped,
+        GroupByTest({table->GetColumnByName("argument0")},
+                    {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                    /*use_threads=*/false, /*use_exec_plan=*/true));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(
+        ArrayFromJSON(struct_({
+                          field("hash_one", ty),
+                          field("key_0", int64()),
+                      }),
+                      (ty->name() == "date64") ? date64_expected : default_expected),
+        aggregated_and_grouped,
+        /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneDecimal) {
+  auto in_schema = schema({
+      field("argument0", decimal128(3, 2)),
+      field("argument1", decimal256(3, 2)),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {/*true, */ false}) {

Review comment:
       The `true` option for `use_threads` is greyed out in all these test because, well, they run parallelly, and the way `hash_one` is currently implemented, when it encounters a value (null or otherwise) for a group for the first time, it stores that value and never updates it.
   For example, in the table below, 
   - If we go serially, `["4.01", "4.01",   null]` is the first row encountered which has `null` as its key. But,
   - If we go parallelly, `["0.75",  "0.75",  null]` is the first row encountered which has `null` as its key.
   
   And while this is totally expected, this is a bit tricky to test for in a single go. So should we test explicitly for both `use_threads` = `true` and `false`?

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2460,476 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      AssertDatumsEqual(ArrayFromJSON(struct_({
+                                          field("hash_one", float64()),
+                                          field("hash_one", null()),
+                                          field("hash_one", boolean()),
+                                          field("key_0", int64()),
+                                      }),
+                                      R"([
+    [1.0,  null, true,  1],
+    [0.0,  null, false, 2],
+    [null, null, false, 3],
+    [4.0,  null, null,  null]
+  ])"),
+                        aggregated_and_grouped,
+                        /*verbose=*/true);
+    }
+  }
+}
+
+TEST(GroupBy, OneTypes) {

Review comment:
       Most of these tests are carried over from MinMax tests with appropriate modifications. Is that okay? Should some more tests also be added?

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,333 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+template <typename Type, typename Enable = void>
+struct GroupedOneImpl final : public GroupedAggregator {
+  using CType = typename TypeTraits<Type>::CType;
+  using GetSet = GroupedValueTraits<Type>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    // out_type_ initialized by GroupedOneInit
+    ones_ = TypedBufferBuilder<CType>(ctx->memory_pool());
+    has_one_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    num_groups_ = new_num_groups;
+    RETURN_NOT_OK(ones_.Append(added_groups, static_cast<CType>(0)));
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    auto raw_ones_ = ones_.mutable_data();
+
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, CType val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            GetSet::Set(raw_ones_, g, val);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    auto raw_ones = ones_.mutable_data();
+    auto other_raw_ones = other->ones_.mutable_data();
+
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          GetSet::Set(raw_ones, *g, GetSet::Get(other_raw_ones, other_g));
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    ARROW_ASSIGN_OR_RAISE(auto data, ones_.Finish());
+    return ArrayData::Make(out_type_, num_groups_,
+                           {std::move(null_bitmap), std::move(data)});
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return out_type_; }
+
+  int64_t num_groups_;
+  TypedBufferBuilder<CType> ones_;
+  TypedBufferBuilder<bool> has_one_, has_value_;
+  std::shared_ptr<DataType> out_type_;
+};
+
+struct GroupedNullOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override { return Status::OK(); }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    return ArrayData::Make(null(), num_groups_, {nullptr}, num_groups_);
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return null(); }
+
+  int64_t num_groups_;
+};
+
+template <typename Type>
+struct GroupedOneImpl<Type, enable_if_t<is_base_binary_type<Type>::value ||
+                                        std::is_same<Type, FixedSizeBinaryType>::value>>
+    final : public GroupedAggregator {
+  using Allocator = arrow::stl::allocator<char>;
+  using StringType = std::basic_string<char, std::char_traits<char>, Allocator>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    allocator_ = Allocator(ctx->memory_pool());
+    // out_type_ initialized by GroupedOneInit
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    DCHECK_GE(added_groups, 0);
+    num_groups_ = new_num_groups;
+    ones_.resize(new_num_groups);
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, util::string_view val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            ones_[g].emplace(val.data(), val.size(), allocator_);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          // as has_one_ is set, has_value_ will never be set, resulting in null
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          ones_[*g] = std::move(other->ones_[other_g]);
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    auto ones =
+        ArrayData::Make(out_type(), num_groups_, {std::move(null_bitmap), nullptr});
+    RETURN_NOT_OK(MakeOffsetsValues(ones.get(), ones_));
+    return ones;
+  }
+
+  template <typename T = Type>
+  enable_if_base_binary<T, Status> MakeOffsetsValues(

Review comment:
       These overloaded `MakeOffsetsValues` methods are identical between `GroupedOneImp` and `GroupedMinMaxImp`, should we consider making them helper functions?  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r808498087



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,294 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+TEST(GroupBy, OneMiscTypes) {
+  auto in_schema = schema({
+      field("floats", float64()),
+      field("nulls", null()),
+      field("booleans", boolean()),
+      field("decimal128", decimal128(3, 2)),
+      field("decimal256", decimal256(3, 2)),
+      field("fixed_binary", fixed_size_binary(3)),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {true, false}) {
+    for (bool use_threads : {true, false}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [null, null, true,   null,    null,    null,  1],
+    [1.0,  null, true,   "1.01",  "1.01",  "aaa", 1]
+])",
+                                             R"([
+    [0.0,   null, false, "0.00",  "0.00",  "bac", 2],
+    [null,  null, false, null,    null,    null,  3],
+    [4.0,   null, null,  "4.01",  "4.01",  "234", null],
+    [3.25,  null, true,  "3.25",  "3.25",  "ddd", 1],
+    [0.125, null, false, "0.12",  "0.12",  "bcd", 2]
+])",
+                                             R"([
+    [-0.25, null, false, "-0.25", "-0.25", "bab", 2],
+    [0.75,  null, true,  "0.75",  "0.75",  "123", null],
+    [null,  null, true,  null,    null,    null,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("floats"),
+                                   table->GetColumnByName("nulls"),
+                                   table->GetColumnByName("booleans"),
+                                   table->GetColumnByName("decimal128"),
+                                   table->GetColumnByName("decimal256"),
+                                   table->GetColumnByName("fixed_binary"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, null])"),
+                        struct_arr->field(struct_arr->num_fields() - 1));
+
+      //  Check values individually
+      auto col_0_type = float64();
+      const auto& col_0 = struct_arr->field(0);
+      EXPECT_THAT(col_0->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_0_type, R"([1.0, 3.25])")));
+      EXPECT_THAT(col_0->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_0_type, R"([0.0, 0.125, -0.25])")));
+      EXPECT_THAT(col_0->GetScalar(2), ResultWith(AnyOfJSON(col_0_type, R"([null])")));
+      EXPECT_THAT(col_0->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_0_type, R"([4.0, 0.75])")));
+
+      auto col_1_type = null();
+      const auto& col_1 = struct_arr->field(1);
+      EXPECT_THAT(col_1->GetScalar(0), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(1), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(2), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(3), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+
+      auto col_2_type = boolean();
+      const auto& col_2 = struct_arr->field(2);
+      EXPECT_THAT(col_2->GetScalar(0), ResultWith(AnyOfJSON(col_2_type, R"([true])")));
+      EXPECT_THAT(col_2->GetScalar(1), ResultWith(AnyOfJSON(col_2_type, R"([false])")));
+      EXPECT_THAT(col_2->GetScalar(2),
+                  ResultWith(AnyOfJSON(col_2_type, R"([true, false])")));
+      EXPECT_THAT(col_2->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_2_type, R"([true, null])")));
+
+      auto col_3_type = decimal128(3, 2);
+      const auto& col_3 = struct_arr->field(3);
+      EXPECT_THAT(col_3->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["1.01", "3.25"])")));
+      EXPECT_THAT(col_3->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["0.00", "0.12", "-0.25"])")));
+      EXPECT_THAT(col_3->GetScalar(2), ResultWith(AnyOfJSON(col_3_type, R"([null])")));
+      EXPECT_THAT(col_3->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["4.01", "0.75"])")));
+
+      auto col_4_type = decimal256(3, 2);
+      const auto& col_4 = struct_arr->field(4);
+      EXPECT_THAT(col_4->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["1.01", "3.25"])")));
+      EXPECT_THAT(col_4->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["0.00", "0.12", "-0.25"])")));
+      EXPECT_THAT(col_4->GetScalar(2), ResultWith(AnyOfJSON(col_4_type, R"([null])")));
+      EXPECT_THAT(col_4->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["4.01", "0.75"])")));
+
+      auto col_5_type = fixed_size_binary(3);
+      const auto& col_5 = struct_arr->field(5);
+      EXPECT_THAT(col_5->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["aaa", "ddd"])")));
+      EXPECT_THAT(col_5->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["bab", "bcd", "bac"])")));
+      EXPECT_THAT(col_5->GetScalar(2), ResultWith(AnyOfJSON(col_5_type, R"([null])")));
+      EXPECT_THAT(col_5->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["123", "234"])")));
+    }
+  }
+}
+
+TEST(GroupBy, OneNumericTypes) {
+  std::vector<std::shared_ptr<DataType>> types;
+  types.insert(types.end(), NumericTypes().begin(), NumericTypes().end());
+  types.insert(types.end(), TemporalTypes().begin(), TemporalTypes().end());
+  types.push_back(month_interval());
+
+  const std::vector<std::string> numeric_table_json = {R"([
+      [null, 1],
+      [1,    1]
+    ])",
+                                                       R"([
+      [0,    2],
+      [null, 3],
+      [3,    4],
+      [5,    4],
+      [4,    null],
+      [3,    1],
+      [0,    2]
+    ])",
+                                                       R"([
+      [0,    2],
+      [1,    null],
+      [null, 3]
+    ])"};
+
+  const std::vector<std::string> temporal_table_json = {R"([
+      [null,      1],
+      [86400000,  1]
+    ])",
+                                                        R"([
+      [0,         2],
+      [null,      3],
+      [259200000, 4],
+      [432000000, 4],
+      [345600000, null],
+      [259200000, 1],
+      [0,         2]
+    ])",
+                                                        R"([
+      [0,         2],
+      [86400000,  null],
+      [null,      3]
+    ])"};
+
+  for (const auto& type : types) {
+    for (bool use_exec_plan : {true, false}) {
+      for (bool use_threads : {true, false}) {
+        SCOPED_TRACE(type->ToString());
+        auto in_schema = schema({field("argument0", type), field("key", int64())});
+        auto table =
+            TableFromJSON(in_schema, (type->name() == "date64") ? temporal_table_json
+                                                                : numeric_table_json);
+        ASSERT_OK_AND_ASSIGN(
+            Datum aggregated_and_grouped,
+            GroupByTest({table->GetColumnByName("argument0")},
+                        {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                        use_threads, use_exec_plan));
+        ValidateOutput(aggregated_and_grouped);
+        SortBy({"key_0"}, &aggregated_and_grouped);
+
+        const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+        //  Check the key column
+        AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, 4, null])"),
+                          struct_arr->field(struct_arr->num_fields() - 1));
+
+        //  Check values individually
+        const auto& col = struct_arr->field(0);
+        if (type->name() == "date64") {
+          EXPECT_THAT(col->GetScalar(0),
+                      ResultWith(AnyOfJSON(type, R"([86400000, 259200000])")));
+          EXPECT_THAT(col->GetScalar(1), ResultWith(AnyOfJSON(type, R"([0])")));
+          EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+          EXPECT_THAT(col->GetScalar(3),
+                      ResultWith(AnyOfJSON(type, R"([259200000, 432000000])")));
+          EXPECT_THAT(col->GetScalar(4),
+                      ResultWith(AnyOfJSON(type, R"([345600000, 86400000])")));
+        } else {
+          EXPECT_THAT(col->GetScalar(0), ResultWith(AnyOfJSON(type, R"([1, 3])")));
+          EXPECT_THAT(col->GetScalar(1), ResultWith(AnyOfJSON(type, R"([0])")));
+          EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+          EXPECT_THAT(col->GetScalar(3), ResultWith(AnyOfJSON(type, R"([3, 5])")));
+          EXPECT_THAT(col->GetScalar(4), ResultWith(AnyOfJSON(type, R"([4, 1])")));
+        }
+      }
+    }
+  }
+}
+
+TEST(GroupBy, OneBinaryTypes) {
+  for (bool use_exec_plan : {true, false}) {
+    for (bool use_threads : {true, false}) {
+      for (const auto& type : BaseBinaryTypes()) {
+        SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+        auto table = TableFromJSON(schema({
+                                       field("argument0", type),
+                                       field("key", int64()),
+                                   }),
+                                   {R"([
+    [null,   1],
+    ["aaaa", 1]
+])",
+                                    R"([
+    ["babcd",2],
+    [null,   3],
+    ["2",    null],
+    ["d",    1],
+    ["bc",   2]
+])",
+                                    R"([
+    ["bcd", 2],
+    ["123", null],
+    [null,  3]
+])"});
+
+        ASSERT_OK_AND_ASSIGN(
+            Datum aggregated_and_grouped,
+            GroupByTest({table->GetColumnByName("argument0")},
+                        {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                        use_threads, use_exec_plan));
+        ValidateOutput(aggregated_and_grouped);
+        SortBy({"key_0"}, &aggregated_and_grouped);
+
+        const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+        //  Check the key column
+        AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, null])"),
+                          struct_arr->field(struct_arr->num_fields() - 1));
+
+        const auto& col = struct_arr->field(0);
+        EXPECT_THAT(col->GetScalar(0), ResultWith(AnyOfJSON(type, R"(["aaaa", "d"])")));
+        EXPECT_THAT(col->GetScalar(1),
+                    ResultWith(AnyOfJSON(type, R"(["bcd", "bc", "babcd"])")));
+        EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+        EXPECT_THAT(col->GetScalar(3), ResultWith(AnyOfJSON(type, R"(["2", "123"])")));
+      }
+    }
+  }
+}
+
+TEST(GroupBy, OneScalar) {

Review comment:
       For this input
   ```cpp
   input.batches = {
         ExecBatchFromJSON({ValueDescr::Scalar(int32()), int64()},
                           R"([[-1, 1], [-1, 1], [7, 1], [23, 1]])"),
         ExecBatchFromJSON({ValueDescr::Scalar(int32()), int64()},
                           R"([[-9, 1], [null, 1], [null, 2], [null, 3]])"),
         ExecBatchFromJSON({int32(), int64()}, R"([[22, 1], [3, 2], [4, 3]])"),
     };
     input.schema = schema({field("argument", int32()), field("key", int64())});
   ```
   this:
   ```cpp
   const auto& struct_arr = actual.array_as<StructArray>();
   ARROW_LOG(WARNING) << struct_arr->ToString();
   ```
   has the following output:
   ```
   -- is_valid: all not null
   -- child 0 type: int32
     [
       -1,
       -9,
       -9
     ]
   -- child 1 type: int64
    [
       1,
       2,
       3
    ]
   ```
   
   And I can't understand what's going on. Also as in the `CountScalar` example above, there is no `ValueDescr::Scalar()` used in the third call of `ExecBatchFromJSON()`, what is the reasoning behind this?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r804679025



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2460,476 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      AssertDatumsEqual(ArrayFromJSON(struct_({
+                                          field("hash_one", float64()),
+                                          field("hash_one", null()),
+                                          field("hash_one", boolean()),
+                                          field("key_0", int64()),
+                                      }),
+                                      R"([
+    [1.0,  null, true,  1],
+    [0.0,  null, false, 2],
+    [null, null, false, 3],
+    [4.0,  null, null,  null]
+  ])"),
+                        aggregated_and_grouped,
+                        /*verbose=*/true);
+    }
+  }
+}
+
+TEST(GroupBy, OneTypes) {
+  std::vector<std::shared_ptr<DataType>> types;
+  types.insert(types.end(), NumericTypes().begin(), NumericTypes().end());
+  types.insert(types.end(), TemporalTypes().begin(), TemporalTypes().end());
+  types.push_back(month_interval());
+
+  const std::vector<std::string> default_table = {R"([
+    [1,    1],
+    [null, 1]
+])",
+                                                  R"([
+    [0,    2],
+    [null, 3],
+    [3,    4],
+    [5,    4],
+    [4,    null],
+    [3,    1],
+    [0,    2]
+])",
+                                                  R"([
+    [0,    2],
+    [1,    null],
+    [null, 3]
+])"};
+
+  const std::vector<std::string> date64_table = {R"([
+    [86400000,  1],
+    [null,      1]
+])",
+                                                 R"([
+    [0,         2],
+    [null,      3],
+    [259200000, 4],
+    [432000000, 4],
+    [345600000, null],
+    [259200000, 1],
+    [0,         2]
+])",
+                                                 R"([
+    [0,         2],
+    [86400000,  null],
+    [null,      3]
+])"};
+
+  const std::string default_expected =
+      R"([
+    [1,    1],
+    [0,    2],
+    [null, 3],
+    [3,    4],
+    [4,    null]
+    ])";
+
+  const std::string date64_expected =
+      R"([
+    [86400000,  1],
+    [0,         2],
+    [null,      3],
+    [259200000, 4],
+    [345600000, null]
+    ])";
+
+  for (const auto& ty : types) {
+    SCOPED_TRACE(ty->ToString());
+    auto in_schema = schema({field("argument0", ty), field("key", int64())});
+    auto table =
+        TableFromJSON(in_schema, (ty->name() == "date64") ? date64_table : default_table);
+
+    ASSERT_OK_AND_ASSIGN(
+        Datum aggregated_and_grouped,
+        GroupByTest({table->GetColumnByName("argument0")},
+                    {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                    /*use_threads=*/false, /*use_exec_plan=*/true));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(
+        ArrayFromJSON(struct_({
+                          field("hash_one", ty),
+                          field("key_0", int64()),
+                      }),
+                      (ty->name() == "date64") ? date64_expected : default_expected),
+        aggregated_and_grouped,
+        /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneDecimal) {
+  auto in_schema = schema({
+      field("argument0", decimal128(3, 2)),
+      field("argument1", decimal256(3, 2)),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {/*true, */ false}) {

Review comment:
       Or you can try to make AnyOf work with EXPECT_THAT: https://github.com/google/googletest/blob/main/docs/reference/matchers.md#composite-matchers




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r808206494



##########
File path: cpp/src/arrow/testing/matchers.h
##########
@@ -61,6 +61,67 @@ class PointeesEqualMatcher {
 // Useful in conjunction with other googletest matchers.
 inline PointeesEqualMatcher PointeesEqual() { return {}; }
 
+class AnyOfJSONMatcher : public ::testing::internal::MatcherBaseImpl<AnyOfJSONMatcher> {
+ public:
+  using AnyOfJSONMatcher::MatcherBaseImpl::MatcherBaseImpl;
+  AnyOfJSONMatcher(std::shared_ptr<DataType> type, std::string array_json)
+      : type_(std::move(type)), array_json_(std::move(array_json)) {}
+
+  template <typename arg_type>
+  operator testing::Matcher<arg_type>() const {  // NOLINT runtime/explicit
+    class Impl : public ::testing::MatcherInterface<const arg_type&> {
+     public:
+      explicit Impl(std::shared_ptr<DataType> type, std::string array_json)
+          : type_(std::move(type)), array_json_(std::move(array_json)) {
+        array = ArrayFromJSON(type_, array_json_);
+      }
+      void DescribeTo(std::ostream* os) const override {
+        *os << "matches at least one scalar from ";
+        *os << array->ToString();
+      }
+      void DescribeNegationTo(::std::ostream* os) const override {
+        *os << "matches no scalar from ";
+        *os << array->ToString();
+      }
+      bool MatchAndExplain(
+          const arg_type& arg,
+          ::testing::MatchResultListener* result_listener) const override {
+        for (int64_t i = 0; i < array->length(); ++i) {
+          std::shared_ptr<Scalar> scalar;
+          auto maybe_scalar = array->GetScalar(i);
+          if (maybe_scalar.ok()) {
+            scalar = maybe_scalar.ValueOrDie();
+          } else {
+            *result_listener << "GetScalar() had status "
+                             << maybe_scalar.status().ToString() << "at index " << i
+                             << " in the input JSON Array";
+            return false;
+          }
+
+          if (scalar->Equals(arg)) return true;
+        }
+        *result_listener << "Argument scalar: '" << arg->ToString()
+                         << "' matches no scalar from " << array->ToString();
+        return false;
+      }
+      const std::shared_ptr<DataType> type_;
+      const std::string array_json_;
+      std::shared_ptr<Array> array;
+    };
+
+    return testing::Matcher<arg_type>(new Impl(type_, array_json_));
+  }
+
+ private:
+  const std::shared_ptr<DataType> type_;
+  const std::string array_json_;
+};
+
+inline AnyOfJSONMatcher AnyOfJSON(std::shared_ptr<DataType> type,

Review comment:
       Had to learn the hard way that `inline` is important. 😅




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r805961512



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);

Review comment:
       This is what the `EXPECT_THAT` and `AnyOf` approach might look like, for testing _just one_ column.

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);
+
+      CountOptions all(CountOptions::ALL);
+      ASSERT_OK_AND_ASSIGN(
+          auto distinct_out,
+          internal::GroupBy(
+              {
+                  table->GetColumnByName("argument0"),
+                  table->GetColumnByName("argument1"),
+                  table->GetColumnByName("argument2"),
+              },
+              {
+                  table->GetColumnByName("key"),
+              },
+              {{"hash_distinct", &all}, {"hash_distinct", &all}, {"hash_distinct", &all}},
+              use_threads));
+      ValidateOutput(distinct_out);
+      SortBy({"key_0"}, &distinct_out);
+
+      const auto& struct_arr_distinct = distinct_out.array_as<StructArray>();
+      for (int64_t col = 0; col < struct_arr_distinct->length() - 1; ++col) {
+        const auto matcher = AnyOfScalarFromUniques(
+            checked_pointer_cast<ListArray>(struct_arr_distinct->field(col)));
+        EXPECT_THAT(struct_arr->field(col), matcher);
+      }

Review comment:
       While this tests _all_ the columns (the key column is not a `ListArray` so will have to be tested manually, but that's non-trivial). So if using other kernels is not strictly discouraged to write test, this is a rather clean way of doing this. @lidavidm 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r806838598



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {

Review comment:
       IIRC, I think these convenience macros aren't always available in the CI environments we use. See https://github.com/apache/arrow/commit/cd30dea861d6dfd670032c655f329cb16bb99a7a

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,333 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+template <typename Type, typename Enable = void>
+struct GroupedOneImpl final : public GroupedAggregator {
+  using CType = typename TypeTraits<Type>::CType;
+  using GetSet = GroupedValueTraits<Type>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    // out_type_ initialized by GroupedOneInit
+    ones_ = TypedBufferBuilder<CType>(ctx->memory_pool());
+    has_one_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    num_groups_ = new_num_groups;
+    RETURN_NOT_OK(ones_.Append(added_groups, static_cast<CType>(0)));
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    auto raw_ones_ = ones_.mutable_data();
+
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, CType val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            GetSet::Set(raw_ones_, g, val);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    auto raw_ones = ones_.mutable_data();
+    auto other_raw_ones = other->ones_.mutable_data();
+
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          GetSet::Set(raw_ones, *g, GetSet::Get(other_raw_ones, other_g));
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    ARROW_ASSIGN_OR_RAISE(auto data, ones_.Finish());
+    return ArrayData::Make(out_type_, num_groups_,
+                           {std::move(null_bitmap), std::move(data)});
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return out_type_; }
+
+  int64_t num_groups_;
+  TypedBufferBuilder<CType> ones_;
+  TypedBufferBuilder<bool> has_one_, has_value_;
+  std::shared_ptr<DataType> out_type_;
+};
+
+struct GroupedNullOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override { return Status::OK(); }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    return ArrayData::Make(null(), num_groups_, {nullptr}, num_groups_);
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return null(); }
+
+  int64_t num_groups_;
+};
+
+template <typename Type>
+struct GroupedOneImpl<Type, enable_if_t<is_base_binary_type<Type>::value ||
+                                        std::is_same<Type, FixedSizeBinaryType>::value>>
+    final : public GroupedAggregator {
+  using Allocator = arrow::stl::allocator<char>;
+  using StringType = std::basic_string<char, std::char_traits<char>, Allocator>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    allocator_ = Allocator(ctx->memory_pool());
+    // out_type_ initialized by GroupedOneInit
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    DCHECK_GE(added_groups, 0);
+    num_groups_ = new_num_groups;
+    ones_.resize(new_num_groups);
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, util::string_view val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            ones_[g].emplace(val.data(), val.size(), allocator_);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          // as has_one_ is set, has_value_ will never be set, resulting in null
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          ones_[*g] = std::move(other->ones_[other_g]);
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    auto ones =
+        ArrayData::Make(out_type(), num_groups_, {std::move(null_bitmap), nullptr});
+    RETURN_NOT_OK(MakeOffsetsValues(ones.get(), ones_));
+    return ones;
+  }
+
+  template <typename T = Type>
+  enable_if_base_binary<T, Status> MakeOffsetsValues(
+      ArrayData* array, const std::vector<util::optional<StringType>>& values) {
+    using offset_type = typename T::offset_type;
+    ARROW_ASSIGN_OR_RAISE(
+        auto raw_offsets,
+        AllocateBuffer((1 + values.size()) * sizeof(offset_type), ctx_->memory_pool()));
+    auto* offsets = reinterpret_cast<offset_type*>(raw_offsets->mutable_data());
+    offsets[0] = 0;
+    offsets++;
+    const uint8_t* null_bitmap = array->buffers[0]->data();
+    offset_type total_length = 0;
+    for (size_t i = 0; i < values.size(); i++) {
+      if (bit_util::GetBit(null_bitmap, i)) {
+        const util::optional<StringType>& value = values[i];
+        DCHECK(value.has_value());
+        if (value->size() >
+                static_cast<size_t>(std::numeric_limits<offset_type>::max()) ||
+            arrow::internal::AddWithOverflow(
+                total_length, static_cast<offset_type>(value->size()), &total_length)) {
+          return Status::Invalid("Result is too large to fit in ", *array->type,
+                                 " cast to large_ variant of type");
+        }
+      }
+      offsets[i] = total_length;
+    }
+    ARROW_ASSIGN_OR_RAISE(auto data, AllocateBuffer(total_length, ctx_->memory_pool()));
+    int64_t offset = 0;
+    for (size_t i = 0; i < values.size(); i++) {
+      if (bit_util::GetBit(null_bitmap, i)) {
+        const util::optional<StringType>& value = values[i];
+        DCHECK(value.has_value());
+        std::memcpy(data->mutable_data() + offset, value->data(), value->size());
+        offset += value->size();
+      }
+    }
+    array->buffers[1] = std::move(raw_offsets);
+    array->buffers.push_back(std::move(data));
+    return Status::OK();
+  }
+
+  template <typename T = Type>
+  enable_if_same<T, FixedSizeBinaryType, Status> MakeOffsetsValues(
+      ArrayData* array, const std::vector<util::optional<StringType>>& values) {
+    const uint8_t* null_bitmap = array->buffers[0]->data();
+    const int32_t slot_width =
+        checked_cast<const FixedSizeBinaryType&>(*array->type).byte_width();
+    int64_t total_length = values.size() * slot_width;
+    ARROW_ASSIGN_OR_RAISE(auto data, AllocateBuffer(total_length, ctx_->memory_pool()));
+    int64_t offset = 0;
+    for (size_t i = 0; i < values.size(); i++) {
+      if (bit_util::GetBit(null_bitmap, i)) {
+        const util::optional<StringType>& value = values[i];
+        DCHECK(value.has_value());
+        std::memcpy(data->mutable_data() + offset, value->data(), slot_width);
+      } else {
+        std::memset(data->mutable_data() + offset, 0x00, slot_width);
+      }
+      offset += slot_width;
+    }
+    array->buffers[1] = std::move(data);
+    return Status::OK();
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return out_type_; }
+
+  ExecContext* ctx_;
+  Allocator allocator_;
+  int64_t num_groups_;
+  std::vector<util::optional<StringType>> ones_;
+  TypedBufferBuilder<bool> has_one_, has_value_;
+  std::shared_ptr<DataType> out_type_;
+};
+
+template <typename T>
+Result<std::unique_ptr<KernelState>> GroupedOneInit(KernelContext* ctx,
+                                                    const KernelInitArgs& args) {
+  ARROW_ASSIGN_OR_RAISE(auto impl, HashAggregateInit<GroupedOneImpl<T>>(ctx, args));
+  auto instance = static_cast<GroupedOneImpl<T>*>(impl.get());
+  instance->out_type_ = args.inputs[0].type;
+  return std::move(impl);
+}
+
+struct GroupedOneFactory {
+  template <typename T>
+  enable_if_physical_integer<T, Status> Visit(const T&) {
+    using PhysicalType = typename T::PhysicalType;
+    kernel = MakeKernel(std::move(argument_type), GroupedOneInit<PhysicalType>);
+    return Status::OK();
+  }
+
+  // MSVC2015 apparently doesn't compile this properly if we use

Review comment:
       we got rid of MSVC2015 so we can replace these two overloads with enable_if_floating_point.

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();

Review comment:
       We could handle the error instead and report an assertion failure if GetScalar fails.

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,333 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+template <typename Type, typename Enable = void>
+struct GroupedOneImpl final : public GroupedAggregator {
+  using CType = typename TypeTraits<Type>::CType;
+  using GetSet = GroupedValueTraits<Type>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    // out_type_ initialized by GroupedOneInit
+    ones_ = TypedBufferBuilder<CType>(ctx->memory_pool());
+    has_one_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    num_groups_ = new_num_groups;
+    RETURN_NOT_OK(ones_.Append(added_groups, static_cast<CType>(0)));
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    auto raw_ones_ = ones_.mutable_data();
+
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, CType val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            GetSet::Set(raw_ones_, g, val);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    auto raw_ones = ones_.mutable_data();
+    auto other_raw_ones = other->ones_.mutable_data();
+
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          GetSet::Set(raw_ones, *g, GetSet::Get(other_raw_ones, other_g));
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    ARROW_ASSIGN_OR_RAISE(auto data, ones_.Finish());
+    return ArrayData::Make(out_type_, num_groups_,
+                           {std::move(null_bitmap), std::move(data)});
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return out_type_; }
+
+  int64_t num_groups_;
+  TypedBufferBuilder<CType> ones_;
+  TypedBufferBuilder<bool> has_one_, has_value_;
+  std::shared_ptr<DataType> out_type_;
+};
+
+struct GroupedNullOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override { return Status::OK(); }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    return ArrayData::Make(null(), num_groups_, {nullptr}, num_groups_);
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return null(); }
+
+  int64_t num_groups_;
+};
+
+template <typename Type>
+struct GroupedOneImpl<Type, enable_if_t<is_base_binary_type<Type>::value ||
+                                        std::is_same<Type, FixedSizeBinaryType>::value>>
+    final : public GroupedAggregator {
+  using Allocator = arrow::stl::allocator<char>;
+  using StringType = std::basic_string<char, std::char_traits<char>, Allocator>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    allocator_ = Allocator(ctx->memory_pool());
+    // out_type_ initialized by GroupedOneInit
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    DCHECK_GE(added_groups, 0);
+    num_groups_ = new_num_groups;
+    ones_.resize(new_num_groups);
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, util::string_view val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            ones_[g].emplace(val.data(), val.size(), allocator_);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          // as has_one_ is set, has_value_ will never be set, resulting in null
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          ones_[*g] = std::move(other->ones_[other_g]);
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    auto ones =
+        ArrayData::Make(out_type(), num_groups_, {std::move(null_bitmap), nullptr});
+    RETURN_NOT_OK(MakeOffsetsValues(ones.get(), ones_));
+    return ones;
+  }
+
+  template <typename T = Type>
+  enable_if_base_binary<T, Status> MakeOffsetsValues(

Review comment:
       We could factor those out, yeah

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);

Review comment:
       ResultWith is in matchers.h: https://github.com/apache/arrow/blob/26d6e6217ff79451a3fe366bcc88293c7ae67417/cpp/src/arrow/testing/matchers.h#L250-L254

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,92 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+struct GroupedOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    pool_ = ctx->memory_pool();
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    ARROW_ASSIGN_OR_RAISE(std::ignore, grouper_->Consume(batch));
+    return Status::OK();
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    // Get (value, group_id) pairs, then translate the group IDs and consume them
+    // ourselves
+    ARROW_ASSIGN_OR_RAISE(auto uniques, other->grouper_->GetUniques());
+    ARROW_ASSIGN_OR_RAISE(auto remapped_g,
+                          AllocateBuffer(uniques.length * sizeof(uint32_t), pool_));
+
+    const auto* g_mapping = group_id_mapping.GetValues<uint32_t>(1);
+    const auto* other_g = uniques[1].array()->GetValues<uint32_t>(1);
+    auto* g = reinterpret_cast<uint32_t*>(remapped_g->mutable_data());
+
+    for (int64_t i = 0; i < uniques.length; i++) {
+      g[i] = g_mapping[other_g[i]];
+    }
+    uniques.values[1] =
+        ArrayData::Make(uint32(), uniques.length, {nullptr, std::move(remapped_g)});
+
+    return Consume(std::move(uniques));
+  }
+
+  Result<Datum> Finalize() override {

Review comment:
       Hash aggregates can be executed in parallel
   
   Consume takes an input batch and updates local state
   Merge takes two local states and combines them
   Finalize takes a local state and produces the ouput array

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2460,476 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+TEST(GroupBy, One) {

Review comment:
       I think we can remove this, and we can consolidate test cases to be more compact.
   
   We can have one test for all the numeric types ("OneTypes", though maybe let's rename it "OneNumericTypes" or something?), then one test for all the "misc" types (write out one large input for null, boolean, decimal128, decimal256, fixed size binary), and one test for all the binary types (iterate through binary/large binary/string/large string).

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,333 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+template <typename Type, typename Enable = void>
+struct GroupedOneImpl final : public GroupedAggregator {
+  using CType = typename TypeTraits<Type>::CType;
+  using GetSet = GroupedValueTraits<Type>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    // out_type_ initialized by GroupedOneInit
+    ones_ = TypedBufferBuilder<CType>(ctx->memory_pool());
+    has_one_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    num_groups_ = new_num_groups;
+    RETURN_NOT_OK(ones_.Append(added_groups, static_cast<CType>(0)));
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    auto raw_ones_ = ones_.mutable_data();
+
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, CType val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            GetSet::Set(raw_ones_, g, val);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          bit_util::SetBit(has_one_.mutable_data(), g);

Review comment:
       Hmm, maybe we don't want this? That is, we could remove this and "bias" the kernel towards not returning null.

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);

Review comment:
       I think something like `EXPECT_THAT(col0->GetScalar(0), ResultWith(AnyOfScalar(...))` could shorten this. Also, we could make a helper function `AnyOfJSON(type, str)` which calls `AnyOfScalar(ArrayFromJSON())` for you.

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);
+
+      CountOptions all(CountOptions::ALL);
+      ASSERT_OK_AND_ASSIGN(
+          auto distinct_out,
+          internal::GroupBy(
+              {
+                  table->GetColumnByName("argument0"),
+                  table->GetColumnByName("argument1"),
+                  table->GetColumnByName("argument2"),
+              },
+              {
+                  table->GetColumnByName("key"),
+              },
+              {{"hash_distinct", &all}, {"hash_distinct", &all}, {"hash_distinct", &all}},
+              use_threads));
+      ValidateOutput(distinct_out);
+      SortBy({"key_0"}, &distinct_out);
+
+      const auto& struct_arr_distinct = distinct_out.array_as<StructArray>();
+      for (int64_t col = 0; col < struct_arr_distinct->length() - 1; ++col) {
+        const auto matcher = AnyOfScalarFromUniques(
+            checked_pointer_cast<ListArray>(struct_arr_distinct->field(col)));
+        EXPECT_THAT(struct_arr->field(col), matcher);
+      }

Review comment:
       We can use other kernels, but I'm not sure this is any cleaner. The other approach is repetitive, but clear about what's going on. This requires a lot of thought to see what's happening.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r806444672



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);
+
+      CountOptions all(CountOptions::ALL);
+      ASSERT_OK_AND_ASSIGN(
+          auto distinct_out,
+          internal::GroupBy(
+              {
+                  table->GetColumnByName("argument0"),
+                  table->GetColumnByName("argument1"),
+                  table->GetColumnByName("argument2"),
+              },
+              {
+                  table->GetColumnByName("key"),
+              },
+              {{"hash_distinct", &all}, {"hash_distinct", &all}, {"hash_distinct", &all}},
+              use_threads));
+      ValidateOutput(distinct_out);
+      SortBy({"key_0"}, &distinct_out);
+
+      const auto& struct_arr_distinct = distinct_out.array_as<StructArray>();
+      for (int64_t col = 0; col < struct_arr_distinct->length() - 1; ++col) {

Review comment:
       ```suggestion
         for (int64_t col = 0; col < struct_arr_distinct->num_fields() - 1; ++col) {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#issuecomment-1032653864






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r801678902



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,92 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+struct GroupedOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    pool_ = ctx->memory_pool();
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    ARROW_ASSIGN_OR_RAISE(std::ignore, grouper_->Consume(batch));
+    return Status::OK();
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    // Get (value, group_id) pairs, then translate the group IDs and consume them
+    // ourselves
+    ARROW_ASSIGN_OR_RAISE(auto uniques, other->grouper_->GetUniques());
+    ARROW_ASSIGN_OR_RAISE(auto remapped_g,
+                          AllocateBuffer(uniques.length * sizeof(uint32_t), pool_));
+
+    const auto* g_mapping = group_id_mapping.GetValues<uint32_t>(1);
+    const auto* other_g = uniques[1].array()->GetValues<uint32_t>(1);
+    auto* g = reinterpret_cast<uint32_t*>(remapped_g->mutable_data());
+
+    for (int64_t i = 0; i < uniques.length; i++) {
+      g[i] = g_mapping[other_g[i]];
+    }
+    uniques.values[1] =
+        ArrayData::Make(uint32(), uniques.length, {nullptr, std::move(remapped_g)});
+
+    return Consume(std::move(uniques));
+  }
+
+  Result<Datum> Finalize() override {

Review comment:
       Most of the stuff is carryover from `DistinctCount`/`Distinct` as I don't fully understand their function (such as `Merge`). Have added a basic test to confirm this is what we what. Also, have taken a naive `ArrayBuilder` approach to get the basics working. What would some better approaches be?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r807884279



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,333 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+template <typename Type, typename Enable = void>
+struct GroupedOneImpl final : public GroupedAggregator {
+  using CType = typename TypeTraits<Type>::CType;
+  using GetSet = GroupedValueTraits<Type>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    // out_type_ initialized by GroupedOneInit
+    ones_ = TypedBufferBuilder<CType>(ctx->memory_pool());
+    has_one_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    num_groups_ = new_num_groups;
+    RETURN_NOT_OK(ones_.Append(added_groups, static_cast<CType>(0)));
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    auto raw_ones_ = ones_.mutable_data();
+
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, CType val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            GetSet::Set(raw_ones_, g, val);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    auto raw_ones = ones_.mutable_data();
+    auto other_raw_ones = other->ones_.mutable_data();
+
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          GetSet::Set(raw_ones, *g, GetSet::Get(other_raw_ones, other_g));
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    ARROW_ASSIGN_OR_RAISE(auto data, ones_.Finish());
+    return ArrayData::Make(out_type_, num_groups_,
+                           {std::move(null_bitmap), std::move(data)});
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return out_type_; }
+
+  int64_t num_groups_;
+  TypedBufferBuilder<CType> ones_;
+  TypedBufferBuilder<bool> has_one_, has_value_;
+  std::shared_ptr<DataType> out_type_;
+};
+
+struct GroupedNullOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override { return Status::OK(); }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    return ArrayData::Make(null(), num_groups_, {nullptr}, num_groups_);
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return null(); }
+
+  int64_t num_groups_;
+};
+
+template <typename Type>
+struct GroupedOneImpl<Type, enable_if_t<is_base_binary_type<Type>::value ||
+                                        std::is_same<Type, FixedSizeBinaryType>::value>>
+    final : public GroupedAggregator {
+  using Allocator = arrow::stl::allocator<char>;
+  using StringType = std::basic_string<char, std::char_traits<char>, Allocator>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    allocator_ = Allocator(ctx->memory_pool());
+    // out_type_ initialized by GroupedOneInit
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    DCHECK_GE(added_groups, 0);
+    num_groups_ = new_num_groups;
+    ones_.resize(new_num_groups);
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, util::string_view val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            ones_[g].emplace(val.data(), val.size(), allocator_);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          // as has_one_ is set, has_value_ will never be set, resulting in null
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          ones_[*g] = std::move(other->ones_[other_g]);
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    auto ones =
+        ArrayData::Make(out_type(), num_groups_, {std::move(null_bitmap), nullptr});
+    RETURN_NOT_OK(MakeOffsetsValues(ones.get(), ones_));
+    return ones;
+  }
+
+  template <typename T = Type>
+  enable_if_base_binary<T, Status> MakeOffsetsValues(

Review comment:
       What would this look like? Like would it go in some sort of helper file?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r801678902



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,92 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+struct GroupedOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    pool_ = ctx->memory_pool();
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    ARROW_ASSIGN_OR_RAISE(std::ignore, grouper_->Consume(batch));
+    return Status::OK();
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    // Get (value, group_id) pairs, then translate the group IDs and consume them
+    // ourselves
+    ARROW_ASSIGN_OR_RAISE(auto uniques, other->grouper_->GetUniques());
+    ARROW_ASSIGN_OR_RAISE(auto remapped_g,
+                          AllocateBuffer(uniques.length * sizeof(uint32_t), pool_));
+
+    const auto* g_mapping = group_id_mapping.GetValues<uint32_t>(1);
+    const auto* other_g = uniques[1].array()->GetValues<uint32_t>(1);
+    auto* g = reinterpret_cast<uint32_t*>(remapped_g->mutable_data());
+
+    for (int64_t i = 0; i < uniques.length; i++) {
+      g[i] = g_mapping[other_g[i]];
+    }
+    uniques.values[1] =
+        ArrayData::Make(uint32(), uniques.length, {nullptr, std::move(remapped_g)});
+
+    return Consume(std::move(uniques));
+  }
+
+  Result<Datum> Finalize() override {

Review comment:
       Most of the stuff is carryover from `DistinctCount`/`Distinct` as I don't fully understand their function (such as `Merge`). Have added a basic test to confirm this is what we what. Also, have taken a naive `ArrayBuilder` approach to get this basics working. What would some better approaches be?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r808001340



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,315 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P2(AnyOfJSON, type, array_json, "") {

Review comment:
       Trying to mitigate this right now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r808479628



##########
File path: cpp/src/arrow/testing/matchers.h
##########
@@ -61,6 +61,65 @@ class PointeesEqualMatcher {
 // Useful in conjunction with other googletest matchers.
 inline PointeesEqualMatcher PointeesEqual() { return {}; }
 
+class AnyOfJSONMatcher {

Review comment:
       Have tried to match this to the other matcher definitions in the file. Will this suffice?

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,294 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+TEST(GroupBy, OneMiscTypes) {
+  auto in_schema = schema({
+      field("floats", float64()),
+      field("nulls", null()),
+      field("booleans", boolean()),
+      field("decimal128", decimal128(3, 2)),
+      field("decimal256", decimal256(3, 2)),
+      field("fixed_binary", fixed_size_binary(3)),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {true, false}) {
+    for (bool use_threads : {true, false}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [null, null, true,   null,    null,    null,  1],
+    [1.0,  null, true,   "1.01",  "1.01",  "aaa", 1]
+])",
+                                             R"([
+    [0.0,   null, false, "0.00",  "0.00",  "bac", 2],
+    [null,  null, false, null,    null,    null,  3],
+    [4.0,   null, null,  "4.01",  "4.01",  "234", null],
+    [3.25,  null, true,  "3.25",  "3.25",  "ddd", 1],
+    [0.125, null, false, "0.12",  "0.12",  "bcd", 2]
+])",
+                                             R"([
+    [-0.25, null, false, "-0.25", "-0.25", "bab", 2],
+    [0.75,  null, true,  "0.75",  "0.75",  "123", null],
+    [null,  null, true,  null,    null,    null,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("floats"),
+                                   table->GetColumnByName("nulls"),
+                                   table->GetColumnByName("booleans"),
+                                   table->GetColumnByName("decimal128"),
+                                   table->GetColumnByName("decimal256"),
+                                   table->GetColumnByName("fixed_binary"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, null])"),
+                        struct_arr->field(struct_arr->num_fields() - 1));
+
+      //  Check values individually
+      auto col_0_type = float64();
+      const auto& col_0 = struct_arr->field(0);
+      EXPECT_THAT(col_0->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_0_type, R"([1.0, 3.25])")));
+      EXPECT_THAT(col_0->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_0_type, R"([0.0, 0.125, -0.25])")));
+      EXPECT_THAT(col_0->GetScalar(2), ResultWith(AnyOfJSON(col_0_type, R"([null])")));
+      EXPECT_THAT(col_0->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_0_type, R"([4.0, 0.75])")));
+
+      auto col_1_type = null();
+      const auto& col_1 = struct_arr->field(1);
+      EXPECT_THAT(col_1->GetScalar(0), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(1), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(2), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(3), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+
+      auto col_2_type = boolean();
+      const auto& col_2 = struct_arr->field(2);
+      EXPECT_THAT(col_2->GetScalar(0), ResultWith(AnyOfJSON(col_2_type, R"([true])")));
+      EXPECT_THAT(col_2->GetScalar(1), ResultWith(AnyOfJSON(col_2_type, R"([false])")));
+      EXPECT_THAT(col_2->GetScalar(2),
+                  ResultWith(AnyOfJSON(col_2_type, R"([true, false])")));
+      EXPECT_THAT(col_2->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_2_type, R"([true, null])")));
+
+      auto col_3_type = decimal128(3, 2);
+      const auto& col_3 = struct_arr->field(3);
+      EXPECT_THAT(col_3->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["1.01", "3.25"])")));
+      EXPECT_THAT(col_3->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["0.00", "0.12", "-0.25"])")));
+      EXPECT_THAT(col_3->GetScalar(2), ResultWith(AnyOfJSON(col_3_type, R"([null])")));
+      EXPECT_THAT(col_3->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["4.01", "0.75"])")));
+
+      auto col_4_type = decimal256(3, 2);
+      const auto& col_4 = struct_arr->field(4);
+      EXPECT_THAT(col_4->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["1.01", "3.25"])")));
+      EXPECT_THAT(col_4->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["0.00", "0.12", "-0.25"])")));
+      EXPECT_THAT(col_4->GetScalar(2), ResultWith(AnyOfJSON(col_4_type, R"([null])")));
+      EXPECT_THAT(col_4->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["4.01", "0.75"])")));
+
+      auto col_5_type = fixed_size_binary(3);
+      const auto& col_5 = struct_arr->field(5);
+      EXPECT_THAT(col_5->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["aaa", "ddd"])")));
+      EXPECT_THAT(col_5->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["bab", "bcd", "bac"])")));
+      EXPECT_THAT(col_5->GetScalar(2), ResultWith(AnyOfJSON(col_5_type, R"([null])")));
+      EXPECT_THAT(col_5->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["123", "234"])")));
+    }
+  }
+}
+
+TEST(GroupBy, OneNumericTypes) {
+  std::vector<std::shared_ptr<DataType>> types;
+  types.insert(types.end(), NumericTypes().begin(), NumericTypes().end());
+  types.insert(types.end(), TemporalTypes().begin(), TemporalTypes().end());
+  types.push_back(month_interval());
+
+  const std::vector<std::string> numeric_table_json = {R"([
+      [null, 1],
+      [1,    1]
+    ])",
+                                                       R"([
+      [0,    2],
+      [null, 3],
+      [3,    4],
+      [5,    4],
+      [4,    null],
+      [3,    1],
+      [0,    2]
+    ])",
+                                                       R"([
+      [0,    2],
+      [1,    null],
+      [null, 3]
+    ])"};
+
+  const std::vector<std::string> temporal_table_json = {R"([
+      [null,      1],
+      [86400000,  1]
+    ])",
+                                                        R"([
+      [0,         2],
+      [null,      3],
+      [259200000, 4],
+      [432000000, 4],
+      [345600000, null],
+      [259200000, 1],
+      [0,         2]
+    ])",
+                                                        R"([
+      [0,         2],
+      [86400000,  null],
+      [null,      3]
+    ])"};
+
+  for (const auto& type : types) {
+    for (bool use_exec_plan : {true, false}) {
+      for (bool use_threads : {true, false}) {
+        SCOPED_TRACE(type->ToString());
+        auto in_schema = schema({field("argument0", type), field("key", int64())});
+        auto table =
+            TableFromJSON(in_schema, (type->name() == "date64") ? temporal_table_json
+                                                                : numeric_table_json);
+        ASSERT_OK_AND_ASSIGN(
+            Datum aggregated_and_grouped,
+            GroupByTest({table->GetColumnByName("argument0")},
+                        {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                        use_threads, use_exec_plan));
+        ValidateOutput(aggregated_and_grouped);
+        SortBy({"key_0"}, &aggregated_and_grouped);
+
+        const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+        //  Check the key column
+        AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, 4, null])"),
+                          struct_arr->field(struct_arr->num_fields() - 1));
+
+        //  Check values individually
+        const auto& col = struct_arr->field(0);
+        if (type->name() == "date64") {
+          EXPECT_THAT(col->GetScalar(0),
+                      ResultWith(AnyOfJSON(type, R"([86400000, 259200000])")));
+          EXPECT_THAT(col->GetScalar(1), ResultWith(AnyOfJSON(type, R"([0])")));
+          EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+          EXPECT_THAT(col->GetScalar(3),
+                      ResultWith(AnyOfJSON(type, R"([259200000, 432000000])")));
+          EXPECT_THAT(col->GetScalar(4),
+                      ResultWith(AnyOfJSON(type, R"([345600000, 86400000])")));
+        } else {
+          EXPECT_THAT(col->GetScalar(0), ResultWith(AnyOfJSON(type, R"([1, 3])")));
+          EXPECT_THAT(col->GetScalar(1), ResultWith(AnyOfJSON(type, R"([0])")));
+          EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+          EXPECT_THAT(col->GetScalar(3), ResultWith(AnyOfJSON(type, R"([3, 5])")));
+          EXPECT_THAT(col->GetScalar(4), ResultWith(AnyOfJSON(type, R"([4, 1])")));
+        }
+      }
+    }
+  }
+}
+
+TEST(GroupBy, OneBinaryTypes) {
+  for (bool use_exec_plan : {true, false}) {
+    for (bool use_threads : {true, false}) {
+      for (const auto& type : BaseBinaryTypes()) {
+        SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+        auto table = TableFromJSON(schema({
+                                       field("argument0", type),
+                                       field("key", int64()),
+                                   }),
+                                   {R"([
+    [null,   1],
+    ["aaaa", 1]
+])",
+                                    R"([
+    ["babcd",2],
+    [null,   3],
+    ["2",    null],
+    ["d",    1],
+    ["bc",   2]
+])",
+                                    R"([
+    ["bcd", 2],
+    ["123", null],
+    [null,  3]
+])"});
+
+        ASSERT_OK_AND_ASSIGN(
+            Datum aggregated_and_grouped,
+            GroupByTest({table->GetColumnByName("argument0")},
+                        {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                        use_threads, use_exec_plan));
+        ValidateOutput(aggregated_and_grouped);
+        SortBy({"key_0"}, &aggregated_and_grouped);
+
+        const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+        //  Check the key column
+        AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, null])"),
+                          struct_arr->field(struct_arr->num_fields() - 1));
+
+        const auto& col = struct_arr->field(0);
+        EXPECT_THAT(col->GetScalar(0), ResultWith(AnyOfJSON(type, R"(["aaaa", "d"])")));
+        EXPECT_THAT(col->GetScalar(1),
+                    ResultWith(AnyOfJSON(type, R"(["bcd", "bc", "babcd"])")));
+        EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+        EXPECT_THAT(col->GetScalar(3), ResultWith(AnyOfJSON(type, R"(["2", "123"])")));
+      }
+    }
+  }
+}
+
+TEST(GroupBy, OneScalar) {

Review comment:
       Is this the correct way to write this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#issuecomment-1042351840


   For MakeOffsetsValues: another JIRA should suffice if you want to do that later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r807991560



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,315 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P2(AnyOfJSON, type, array_json, "") {
+  auto array = ArrayFromJSON(type, array_json);
+  for (int64_t i = 0; i < array->length(); ++i) {
+    std::shared_ptr<Scalar> scalar;
+    auto maybe_scalar = array->GetScalar(i);
+    if (maybe_scalar.ok()) {
+      scalar = maybe_scalar.ValueOrDie();
+    } else {
+      *result_listener << "Unable to retrieve scalar via GetScalar() "
+                       << "at index " << i << " from the input JSON Array";
+      return false;
+    }
+
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";

Review comment:
       also include `array_json`?

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,315 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P2(AnyOfJSON, type, array_json, "") {
+  auto array = ArrayFromJSON(type, array_json);
+  for (int64_t i = 0; i < array->length(); ++i) {
+    std::shared_ptr<Scalar> scalar;
+    auto maybe_scalar = array->GetScalar(i);
+    if (maybe_scalar.ok()) {
+      scalar = maybe_scalar.ValueOrDie();
+    } else {
+      *result_listener << "Unable to retrieve scalar via GetScalar() "
+                       << "at index " << i << " from the input JSON Array";

Review comment:
       also include `maybe_scalar.status().ToString()`?

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,315 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P2(AnyOfJSON, type, array_json, "") {

Review comment:
       Just reiterating my earlier comment - but we've found before that these convenience macros don't work in all environments.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#issuecomment-1042943111


   Benchmark runs are scheduled for baseline = ed25c616c6270142cce0a2a36c7474e28e167184 and contender = 74f512260fa69903feac61e1287f6954a3d98204. 74f512260fa69903feac61e1287f6954a3d98204 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a94a3f0000b84a85a4d19562ff559187...7b33e4d8929944749d78fd64bf5054fa/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/15b82b3edbb4496baefc9bb7f57453c9...34fcb6b914b4436fae5816b099686be4/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/11c0b23a67e6433c9358d07ced046614...2f00fea949d543acae398e06ffb8787f/)
   [Finished :arrow_down:0.21% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/31368249ed3448b5b3ec5a0396533d0b...ce98c7d67ad84386b4028c02679b0d02/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r805965402



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);
+
+      CountOptions all(CountOptions::ALL);
+      ASSERT_OK_AND_ASSIGN(
+          auto distinct_out,
+          internal::GroupBy(
+              {
+                  table->GetColumnByName("argument0"),
+                  table->GetColumnByName("argument1"),
+                  table->GetColumnByName("argument2"),
+              },
+              {
+                  table->GetColumnByName("key"),
+              },
+              {{"hash_distinct", &all}, {"hash_distinct", &all}, {"hash_distinct", &all}},
+              use_threads));
+      ValidateOutput(distinct_out);
+      SortBy({"key_0"}, &distinct_out);
+
+      const auto& struct_arr_distinct = distinct_out.array_as<StructArray>();
+      for (int64_t col = 0; col < struct_arr_distinct->length() - 1; ++col) {
+        const auto matcher = AnyOfScalarFromUniques(
+            checked_pointer_cast<ListArray>(struct_arr_distinct->field(col)));
+        EXPECT_THAT(struct_arr->field(col), matcher);
+      }

Review comment:
       While this tests _all_ the columns (the key column is not a `ListArray` so will have to be tested manually, but that's trivial). So if using other kernels is not strictly discouraged to write test, this is a rather clean way of doing this. @lidavidm 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#issuecomment-1034907613


   We don't need to support ScalarAggregateOptions here.
   
   If it's more manageable, you can split the support for variable-width types into a separate JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Crystrix commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

Crystrix commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r802311399



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,92 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+struct GroupedOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    pool_ = ctx->memory_pool();
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    ARROW_ASSIGN_OR_RAISE(std::ignore, grouper_->Consume(batch));
+    return Status::OK();
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    // Get (value, group_id) pairs, then translate the group IDs and consume them
+    // ourselves
+    ARROW_ASSIGN_OR_RAISE(auto uniques, other->grouper_->GetUniques());
+    ARROW_ASSIGN_OR_RAISE(auto remapped_g,
+                          AllocateBuffer(uniques.length * sizeof(uint32_t), pool_));
+
+    const auto* g_mapping = group_id_mapping.GetValues<uint32_t>(1);
+    const auto* other_g = uniques[1].array()->GetValues<uint32_t>(1);
+    auto* g = reinterpret_cast<uint32_t*>(remapped_g->mutable_data());
+
+    for (int64_t i = 0; i < uniques.length; i++) {
+      g[i] = g_mapping[other_g[i]];
+    }
+    uniques.values[1] =
+        ArrayData::Make(uint32(), uniques.length, {nullptr, std::move(remapped_g)});
+
+    return Consume(std::move(uniques));
+  }
+
+  Result<Datum> Finalize() override {

Review comment:
       I think the extra `grouper_` variable from `GroupedDistinctImpl` is not necessary as `hash_one` doesn't need to calculate distinct values. The struct of the `hash_one` can also learn from `GroupedMinMaxImpl`. In a way, `min/max` is a special case of `hash_one`.
   
   Like the `mins_` variable in `GroupedMinMaxImpl` which stores the min value of a group, we can have a similar variable to restore the values. Then the remaining operation should be similar to `GroupedMinMaxImpl` without value comparison. 
   
   - `Consume` function, store the value for each group if the value doesn't exist.
   - `Merge` function, add the values of new groups. 
   - `Finalize` function, output the values and groups, which is the same as `GroupedMinMaxImpl`.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r804677914



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2460,476 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      AssertDatumsEqual(ArrayFromJSON(struct_({
+                                          field("hash_one", float64()),
+                                          field("hash_one", null()),
+                                          field("hash_one", boolean()),
+                                          field("key_0", int64()),
+                                      }),
+                                      R"([
+    [1.0,  null, true,  1],
+    [0.0,  null, false, 2],
+    [null, null, false, 3],
+    [4.0,  null, null,  null]
+  ])"),
+                        aggregated_and_grouped,
+                        /*verbose=*/true);
+    }
+  }
+}
+
+TEST(GroupBy, OneTypes) {
+  std::vector<std::shared_ptr<DataType>> types;
+  types.insert(types.end(), NumericTypes().begin(), NumericTypes().end());
+  types.insert(types.end(), TemporalTypes().begin(), TemporalTypes().end());
+  types.push_back(month_interval());
+
+  const std::vector<std::string> default_table = {R"([
+    [1,    1],
+    [null, 1]
+])",
+                                                  R"([
+    [0,    2],
+    [null, 3],
+    [3,    4],
+    [5,    4],
+    [4,    null],
+    [3,    1],
+    [0,    2]
+])",
+                                                  R"([
+    [0,    2],
+    [1,    null],
+    [null, 3]
+])"};
+
+  const std::vector<std::string> date64_table = {R"([
+    [86400000,  1],
+    [null,      1]
+])",
+                                                 R"([
+    [0,         2],
+    [null,      3],
+    [259200000, 4],
+    [432000000, 4],
+    [345600000, null],
+    [259200000, 1],
+    [0,         2]
+])",
+                                                 R"([
+    [0,         2],
+    [86400000,  null],
+    [null,      3]
+])"};
+
+  const std::string default_expected =
+      R"([
+    [1,    1],
+    [0,    2],
+    [null, 3],
+    [3,    4],
+    [4,    null]
+    ])";
+
+  const std::string date64_expected =
+      R"([
+    [86400000,  1],
+    [0,         2],
+    [null,      3],
+    [259200000, 4],
+    [345600000, null]
+    ])";
+
+  for (const auto& ty : types) {
+    SCOPED_TRACE(ty->ToString());
+    auto in_schema = schema({field("argument0", ty), field("key", int64())});
+    auto table =
+        TableFromJSON(in_schema, (ty->name() == "date64") ? date64_table : default_table);
+
+    ASSERT_OK_AND_ASSIGN(
+        Datum aggregated_and_grouped,
+        GroupByTest({table->GetColumnByName("argument0")},
+                    {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                    /*use_threads=*/false, /*use_exec_plan=*/true));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(
+        ArrayFromJSON(struct_({
+                          field("hash_one", ty),
+                          field("key_0", int64()),
+                      }),
+                      (ty->name() == "date64") ? date64_expected : default_expected),
+        aggregated_and_grouped,
+        /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneDecimal) {
+  auto in_schema = schema({
+      field("argument0", decimal128(3, 2)),
+      field("argument1", decimal256(3, 2)),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {/*true, */ false}) {

Review comment:
       Hmm, we can't assume any particular value and we shouldn't assume anything about the order that the kernel gets batches whether it's serial or parallel. That makes the test hard to write, though.
   
   We could do something like this:
   ```cpp
   const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
   // Check the key column
   AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, 4, null]"), struct_arr->field(2)));
   // Check values individually
   const auto& col0 = struct_arr->field(0);
   ASSERT_OK_AND_ASSIGN(const auto col0_0, col0->GetScalar(0));
   EXPECT_THAT(col0_0->Equals(*ScalarFromJSON(...)) || col0_0->Equals(*ScalarFromJSON()) ...);
   ```
   
   this would get tedious fast though so I would honestly reduce the number of groups in the first place




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r806838598



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {

Review comment:
       IIRC, I think these convenience macros aren't always available in the CI environments we use. See https://github.com/apache/arrow/commit/cd30dea861d6dfd670032c655f329cb16bb99a7a

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,333 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+template <typename Type, typename Enable = void>
+struct GroupedOneImpl final : public GroupedAggregator {
+  using CType = typename TypeTraits<Type>::CType;
+  using GetSet = GroupedValueTraits<Type>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    // out_type_ initialized by GroupedOneInit
+    ones_ = TypedBufferBuilder<CType>(ctx->memory_pool());
+    has_one_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    num_groups_ = new_num_groups;
+    RETURN_NOT_OK(ones_.Append(added_groups, static_cast<CType>(0)));
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    auto raw_ones_ = ones_.mutable_data();
+
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, CType val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            GetSet::Set(raw_ones_, g, val);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    auto raw_ones = ones_.mutable_data();
+    auto other_raw_ones = other->ones_.mutable_data();
+
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          GetSet::Set(raw_ones, *g, GetSet::Get(other_raw_ones, other_g));
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    ARROW_ASSIGN_OR_RAISE(auto data, ones_.Finish());
+    return ArrayData::Make(out_type_, num_groups_,
+                           {std::move(null_bitmap), std::move(data)});
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return out_type_; }
+
+  int64_t num_groups_;
+  TypedBufferBuilder<CType> ones_;
+  TypedBufferBuilder<bool> has_one_, has_value_;
+  std::shared_ptr<DataType> out_type_;
+};
+
+struct GroupedNullOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override { return Status::OK(); }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    return ArrayData::Make(null(), num_groups_, {nullptr}, num_groups_);
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return null(); }
+
+  int64_t num_groups_;
+};
+
+template <typename Type>
+struct GroupedOneImpl<Type, enable_if_t<is_base_binary_type<Type>::value ||
+                                        std::is_same<Type, FixedSizeBinaryType>::value>>
+    final : public GroupedAggregator {
+  using Allocator = arrow::stl::allocator<char>;
+  using StringType = std::basic_string<char, std::char_traits<char>, Allocator>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    allocator_ = Allocator(ctx->memory_pool());
+    // out_type_ initialized by GroupedOneInit
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    DCHECK_GE(added_groups, 0);
+    num_groups_ = new_num_groups;
+    ones_.resize(new_num_groups);
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, util::string_view val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            ones_[g].emplace(val.data(), val.size(), allocator_);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          // as has_one_ is set, has_value_ will never be set, resulting in null
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          ones_[*g] = std::move(other->ones_[other_g]);
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    auto ones =
+        ArrayData::Make(out_type(), num_groups_, {std::move(null_bitmap), nullptr});
+    RETURN_NOT_OK(MakeOffsetsValues(ones.get(), ones_));
+    return ones;
+  }
+
+  template <typename T = Type>
+  enable_if_base_binary<T, Status> MakeOffsetsValues(
+      ArrayData* array, const std::vector<util::optional<StringType>>& values) {
+    using offset_type = typename T::offset_type;
+    ARROW_ASSIGN_OR_RAISE(
+        auto raw_offsets,
+        AllocateBuffer((1 + values.size()) * sizeof(offset_type), ctx_->memory_pool()));
+    auto* offsets = reinterpret_cast<offset_type*>(raw_offsets->mutable_data());
+    offsets[0] = 0;
+    offsets++;
+    const uint8_t* null_bitmap = array->buffers[0]->data();
+    offset_type total_length = 0;
+    for (size_t i = 0; i < values.size(); i++) {
+      if (bit_util::GetBit(null_bitmap, i)) {
+        const util::optional<StringType>& value = values[i];
+        DCHECK(value.has_value());
+        if (value->size() >
+                static_cast<size_t>(std::numeric_limits<offset_type>::max()) ||
+            arrow::internal::AddWithOverflow(
+                total_length, static_cast<offset_type>(value->size()), &total_length)) {
+          return Status::Invalid("Result is too large to fit in ", *array->type,
+                                 " cast to large_ variant of type");
+        }
+      }
+      offsets[i] = total_length;
+    }
+    ARROW_ASSIGN_OR_RAISE(auto data, AllocateBuffer(total_length, ctx_->memory_pool()));
+    int64_t offset = 0;
+    for (size_t i = 0; i < values.size(); i++) {
+      if (bit_util::GetBit(null_bitmap, i)) {
+        const util::optional<StringType>& value = values[i];
+        DCHECK(value.has_value());
+        std::memcpy(data->mutable_data() + offset, value->data(), value->size());
+        offset += value->size();
+      }
+    }
+    array->buffers[1] = std::move(raw_offsets);
+    array->buffers.push_back(std::move(data));
+    return Status::OK();
+  }
+
+  template <typename T = Type>
+  enable_if_same<T, FixedSizeBinaryType, Status> MakeOffsetsValues(
+      ArrayData* array, const std::vector<util::optional<StringType>>& values) {
+    const uint8_t* null_bitmap = array->buffers[0]->data();
+    const int32_t slot_width =
+        checked_cast<const FixedSizeBinaryType&>(*array->type).byte_width();
+    int64_t total_length = values.size() * slot_width;
+    ARROW_ASSIGN_OR_RAISE(auto data, AllocateBuffer(total_length, ctx_->memory_pool()));
+    int64_t offset = 0;
+    for (size_t i = 0; i < values.size(); i++) {
+      if (bit_util::GetBit(null_bitmap, i)) {
+        const util::optional<StringType>& value = values[i];
+        DCHECK(value.has_value());
+        std::memcpy(data->mutable_data() + offset, value->data(), slot_width);
+      } else {
+        std::memset(data->mutable_data() + offset, 0x00, slot_width);
+      }
+      offset += slot_width;
+    }
+    array->buffers[1] = std::move(data);
+    return Status::OK();
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return out_type_; }
+
+  ExecContext* ctx_;
+  Allocator allocator_;
+  int64_t num_groups_;
+  std::vector<util::optional<StringType>> ones_;
+  TypedBufferBuilder<bool> has_one_, has_value_;
+  std::shared_ptr<DataType> out_type_;
+};
+
+template <typename T>
+Result<std::unique_ptr<KernelState>> GroupedOneInit(KernelContext* ctx,
+                                                    const KernelInitArgs& args) {
+  ARROW_ASSIGN_OR_RAISE(auto impl, HashAggregateInit<GroupedOneImpl<T>>(ctx, args));
+  auto instance = static_cast<GroupedOneImpl<T>*>(impl.get());
+  instance->out_type_ = args.inputs[0].type;
+  return std::move(impl);
+}
+
+struct GroupedOneFactory {
+  template <typename T>
+  enable_if_physical_integer<T, Status> Visit(const T&) {
+    using PhysicalType = typename T::PhysicalType;
+    kernel = MakeKernel(std::move(argument_type), GroupedOneInit<PhysicalType>);
+    return Status::OK();
+  }
+
+  // MSVC2015 apparently doesn't compile this properly if we use

Review comment:
       we got rid of MSVC2015 so we can replace these two overloads with enable_if_floating_point.

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();

Review comment:
       We could handle the error instead and report an assertion failure if GetScalar fails.

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,333 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+template <typename Type, typename Enable = void>
+struct GroupedOneImpl final : public GroupedAggregator {
+  using CType = typename TypeTraits<Type>::CType;
+  using GetSet = GroupedValueTraits<Type>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    // out_type_ initialized by GroupedOneInit
+    ones_ = TypedBufferBuilder<CType>(ctx->memory_pool());
+    has_one_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    num_groups_ = new_num_groups;
+    RETURN_NOT_OK(ones_.Append(added_groups, static_cast<CType>(0)));
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    auto raw_ones_ = ones_.mutable_data();
+
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, CType val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            GetSet::Set(raw_ones_, g, val);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    auto raw_ones = ones_.mutable_data();
+    auto other_raw_ones = other->ones_.mutable_data();
+
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          GetSet::Set(raw_ones, *g, GetSet::Get(other_raw_ones, other_g));
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    ARROW_ASSIGN_OR_RAISE(auto data, ones_.Finish());
+    return ArrayData::Make(out_type_, num_groups_,
+                           {std::move(null_bitmap), std::move(data)});
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return out_type_; }
+
+  int64_t num_groups_;
+  TypedBufferBuilder<CType> ones_;
+  TypedBufferBuilder<bool> has_one_, has_value_;
+  std::shared_ptr<DataType> out_type_;
+};
+
+struct GroupedNullOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override { return Status::OK(); }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    return ArrayData::Make(null(), num_groups_, {nullptr}, num_groups_);
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return null(); }
+
+  int64_t num_groups_;
+};
+
+template <typename Type>
+struct GroupedOneImpl<Type, enable_if_t<is_base_binary_type<Type>::value ||
+                                        std::is_same<Type, FixedSizeBinaryType>::value>>
+    final : public GroupedAggregator {
+  using Allocator = arrow::stl::allocator<char>;
+  using StringType = std::basic_string<char, std::char_traits<char>, Allocator>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    allocator_ = Allocator(ctx->memory_pool());
+    // out_type_ initialized by GroupedOneInit
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    DCHECK_GE(added_groups, 0);
+    num_groups_ = new_num_groups;
+    ones_.resize(new_num_groups);
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, util::string_view val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            ones_[g].emplace(val.data(), val.size(), allocator_);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          // as has_one_ is set, has_value_ will never be set, resulting in null
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          ones_[*g] = std::move(other->ones_[other_g]);
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    auto ones =
+        ArrayData::Make(out_type(), num_groups_, {std::move(null_bitmap), nullptr});
+    RETURN_NOT_OK(MakeOffsetsValues(ones.get(), ones_));
+    return ones;
+  }
+
+  template <typename T = Type>
+  enable_if_base_binary<T, Status> MakeOffsetsValues(

Review comment:
       We could factor those out, yeah

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);

Review comment:
       ResultWith is in matchers.h: https://github.com/apache/arrow/blob/26d6e6217ff79451a3fe366bcc88293c7ae67417/cpp/src/arrow/testing/matchers.h#L250-L254

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,92 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+struct GroupedOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    pool_ = ctx->memory_pool();
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    ARROW_ASSIGN_OR_RAISE(std::ignore, grouper_->Consume(batch));
+    return Status::OK();
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    // Get (value, group_id) pairs, then translate the group IDs and consume them
+    // ourselves
+    ARROW_ASSIGN_OR_RAISE(auto uniques, other->grouper_->GetUniques());
+    ARROW_ASSIGN_OR_RAISE(auto remapped_g,
+                          AllocateBuffer(uniques.length * sizeof(uint32_t), pool_));
+
+    const auto* g_mapping = group_id_mapping.GetValues<uint32_t>(1);
+    const auto* other_g = uniques[1].array()->GetValues<uint32_t>(1);
+    auto* g = reinterpret_cast<uint32_t*>(remapped_g->mutable_data());
+
+    for (int64_t i = 0; i < uniques.length; i++) {
+      g[i] = g_mapping[other_g[i]];
+    }
+    uniques.values[1] =
+        ArrayData::Make(uint32(), uniques.length, {nullptr, std::move(remapped_g)});
+
+    return Consume(std::move(uniques));
+  }
+
+  Result<Datum> Finalize() override {

Review comment:
       Hash aggregates can be executed in parallel
   
   Consume takes an input batch and updates local state
   Merge takes two local states and combines them
   Finalize takes a local state and produces the ouput array

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2460,476 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+TEST(GroupBy, One) {

Review comment:
       I think we can remove this, and we can consolidate test cases to be more compact.
   
   We can have one test for all the numeric types ("OneTypes", though maybe let's rename it "OneNumericTypes" or something?), then one test for all the "misc" types (write out one large input for null, boolean, decimal128, decimal256, fixed size binary), and one test for all the binary types (iterate through binary/large binary/string/large string).

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,333 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+template <typename Type, typename Enable = void>
+struct GroupedOneImpl final : public GroupedAggregator {
+  using CType = typename TypeTraits<Type>::CType;
+  using GetSet = GroupedValueTraits<Type>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    // out_type_ initialized by GroupedOneInit
+    ones_ = TypedBufferBuilder<CType>(ctx->memory_pool());
+    has_one_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    num_groups_ = new_num_groups;
+    RETURN_NOT_OK(ones_.Append(added_groups, static_cast<CType>(0)));
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    auto raw_ones_ = ones_.mutable_data();
+
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, CType val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            GetSet::Set(raw_ones_, g, val);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          bit_util::SetBit(has_one_.mutable_data(), g);

Review comment:
       Hmm, maybe we don't want this? That is, we could remove this and "bias" the kernel towards not returning null.

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);

Review comment:
       I think something like `EXPECT_THAT(col0->GetScalar(0), ResultWith(AnyOfScalar(...))` could shorten this. Also, we could make a helper function `AnyOfJSON(type, str)` which calls `AnyOfScalar(ArrayFromJSON())` for you.

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);
+
+      CountOptions all(CountOptions::ALL);
+      ASSERT_OK_AND_ASSIGN(
+          auto distinct_out,
+          internal::GroupBy(
+              {
+                  table->GetColumnByName("argument0"),
+                  table->GetColumnByName("argument1"),
+                  table->GetColumnByName("argument2"),
+              },
+              {
+                  table->GetColumnByName("key"),
+              },
+              {{"hash_distinct", &all}, {"hash_distinct", &all}, {"hash_distinct", &all}},
+              use_threads));
+      ValidateOutput(distinct_out);
+      SortBy({"key_0"}, &distinct_out);
+
+      const auto& struct_arr_distinct = distinct_out.array_as<StructArray>();
+      for (int64_t col = 0; col < struct_arr_distinct->length() - 1; ++col) {
+        const auto matcher = AnyOfScalarFromUniques(
+            checked_pointer_cast<ListArray>(struct_arr_distinct->field(col)));
+        EXPECT_THAT(struct_arr->field(col), matcher);
+      }

Review comment:
       We can use other kernels, but I'm not sure this is any cleaner. The other approach is repetitive, but clear about what's going on. This requires a lot of thought to see what's happening.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#issuecomment-1042943111


   Benchmark runs are scheduled for baseline = ed25c616c6270142cce0a2a36c7474e28e167184 and contender = 74f512260fa69903feac61e1287f6954a3d98204. 74f512260fa69903feac61e1287f6954a3d98204 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a94a3f0000b84a85a4d19562ff559187...7b33e4d8929944749d78fd64bf5054fa/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/15b82b3edbb4496baefc9bb7f57453c9...34fcb6b914b4436fae5816b099686be4/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/11c0b23a67e6433c9358d07ced046614...2f00fea949d543acae398e06ffb8787f/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/31368249ed3448b5b3ec5a0396533d0b...ce98c7d67ad84386b4028c02679b0d02/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r807957098



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,333 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+template <typename Type, typename Enable = void>
+struct GroupedOneImpl final : public GroupedAggregator {
+  using CType = typename TypeTraits<Type>::CType;
+  using GetSet = GroupedValueTraits<Type>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    // out_type_ initialized by GroupedOneInit
+    ones_ = TypedBufferBuilder<CType>(ctx->memory_pool());
+    has_one_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    num_groups_ = new_num_groups;
+    RETURN_NOT_OK(ones_.Append(added_groups, static_cast<CType>(0)));
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    auto raw_ones_ = ones_.mutable_data();
+
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, CType val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            GetSet::Set(raw_ones_, g, val);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    auto raw_ones = ones_.mutable_data();
+    auto other_raw_ones = other->ones_.mutable_data();
+
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          GetSet::Set(raw_ones, *g, GetSet::Get(other_raw_ones, other_g));
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    ARROW_ASSIGN_OR_RAISE(auto data, ones_.Finish());
+    return ArrayData::Make(out_type_, num_groups_,
+                           {std::move(null_bitmap), std::move(data)});
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return out_type_; }
+
+  int64_t num_groups_;
+  TypedBufferBuilder<CType> ones_;
+  TypedBufferBuilder<bool> has_one_, has_value_;
+  std::shared_ptr<DataType> out_type_;
+};
+
+struct GroupedNullOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override { return Status::OK(); }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    return ArrayData::Make(null(), num_groups_, {nullptr}, num_groups_);
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return null(); }
+
+  int64_t num_groups_;
+};
+
+template <typename Type>
+struct GroupedOneImpl<Type, enable_if_t<is_base_binary_type<Type>::value ||
+                                        std::is_same<Type, FixedSizeBinaryType>::value>>
+    final : public GroupedAggregator {
+  using Allocator = arrow::stl::allocator<char>;
+  using StringType = std::basic_string<char, std::char_traits<char>, Allocator>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    allocator_ = Allocator(ctx->memory_pool());
+    // out_type_ initialized by GroupedOneInit
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    DCHECK_GE(added_groups, 0);
+    num_groups_ = new_num_groups;
+    ones_.resize(new_num_groups);
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, util::string_view val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            ones_[g].emplace(val.data(), val.size(), allocator_);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          // as has_one_ is set, has_value_ will never be set, resulting in null
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          ones_[*g] = std::move(other->ones_[other_g]);
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    auto ones =
+        ArrayData::Make(out_type(), num_groups_, {std::move(null_bitmap), nullptr});
+    RETURN_NOT_OK(MakeOffsetsValues(ones.get(), ones_));
+    return ones;
+  }
+
+  template <typename T = Type>
+  enable_if_base_binary<T, Status> MakeOffsetsValues(

Review comment:
       I think it can stay in this file. If there's not a clear way to factor it out, then let's not.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r808502128



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,294 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+TEST(GroupBy, OneMiscTypes) {
+  auto in_schema = schema({
+      field("floats", float64()),
+      field("nulls", null()),
+      field("booleans", boolean()),
+      field("decimal128", decimal128(3, 2)),
+      field("decimal256", decimal256(3, 2)),
+      field("fixed_binary", fixed_size_binary(3)),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {true, false}) {
+    for (bool use_threads : {true, false}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [null, null, true,   null,    null,    null,  1],
+    [1.0,  null, true,   "1.01",  "1.01",  "aaa", 1]
+])",
+                                             R"([
+    [0.0,   null, false, "0.00",  "0.00",  "bac", 2],
+    [null,  null, false, null,    null,    null,  3],
+    [4.0,   null, null,  "4.01",  "4.01",  "234", null],
+    [3.25,  null, true,  "3.25",  "3.25",  "ddd", 1],
+    [0.125, null, false, "0.12",  "0.12",  "bcd", 2]
+])",
+                                             R"([
+    [-0.25, null, false, "-0.25", "-0.25", "bab", 2],
+    [0.75,  null, true,  "0.75",  "0.75",  "123", null],
+    [null,  null, true,  null,    null,    null,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("floats"),
+                                   table->GetColumnByName("nulls"),
+                                   table->GetColumnByName("booleans"),
+                                   table->GetColumnByName("decimal128"),
+                                   table->GetColumnByName("decimal256"),
+                                   table->GetColumnByName("fixed_binary"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, null])"),
+                        struct_arr->field(struct_arr->num_fields() - 1));
+
+      //  Check values individually
+      auto col_0_type = float64();
+      const auto& col_0 = struct_arr->field(0);
+      EXPECT_THAT(col_0->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_0_type, R"([1.0, 3.25])")));
+      EXPECT_THAT(col_0->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_0_type, R"([0.0, 0.125, -0.25])")));
+      EXPECT_THAT(col_0->GetScalar(2), ResultWith(AnyOfJSON(col_0_type, R"([null])")));
+      EXPECT_THAT(col_0->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_0_type, R"([4.0, 0.75])")));
+
+      auto col_1_type = null();
+      const auto& col_1 = struct_arr->field(1);
+      EXPECT_THAT(col_1->GetScalar(0), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(1), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(2), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(3), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+
+      auto col_2_type = boolean();
+      const auto& col_2 = struct_arr->field(2);
+      EXPECT_THAT(col_2->GetScalar(0), ResultWith(AnyOfJSON(col_2_type, R"([true])")));
+      EXPECT_THAT(col_2->GetScalar(1), ResultWith(AnyOfJSON(col_2_type, R"([false])")));
+      EXPECT_THAT(col_2->GetScalar(2),
+                  ResultWith(AnyOfJSON(col_2_type, R"([true, false])")));
+      EXPECT_THAT(col_2->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_2_type, R"([true, null])")));
+
+      auto col_3_type = decimal128(3, 2);
+      const auto& col_3 = struct_arr->field(3);
+      EXPECT_THAT(col_3->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["1.01", "3.25"])")));
+      EXPECT_THAT(col_3->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["0.00", "0.12", "-0.25"])")));
+      EXPECT_THAT(col_3->GetScalar(2), ResultWith(AnyOfJSON(col_3_type, R"([null])")));
+      EXPECT_THAT(col_3->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["4.01", "0.75"])")));
+
+      auto col_4_type = decimal256(3, 2);
+      const auto& col_4 = struct_arr->field(4);
+      EXPECT_THAT(col_4->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["1.01", "3.25"])")));
+      EXPECT_THAT(col_4->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["0.00", "0.12", "-0.25"])")));
+      EXPECT_THAT(col_4->GetScalar(2), ResultWith(AnyOfJSON(col_4_type, R"([null])")));
+      EXPECT_THAT(col_4->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["4.01", "0.75"])")));
+
+      auto col_5_type = fixed_size_binary(3);
+      const auto& col_5 = struct_arr->field(5);
+      EXPECT_THAT(col_5->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["aaa", "ddd"])")));
+      EXPECT_THAT(col_5->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["bab", "bcd", "bac"])")));
+      EXPECT_THAT(col_5->GetScalar(2), ResultWith(AnyOfJSON(col_5_type, R"([null])")));
+      EXPECT_THAT(col_5->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["123", "234"])")));
+    }
+  }
+}
+
+TEST(GroupBy, OneNumericTypes) {
+  std::vector<std::shared_ptr<DataType>> types;
+  types.insert(types.end(), NumericTypes().begin(), NumericTypes().end());
+  types.insert(types.end(), TemporalTypes().begin(), TemporalTypes().end());
+  types.push_back(month_interval());
+
+  const std::vector<std::string> numeric_table_json = {R"([
+      [null, 1],
+      [1,    1]
+    ])",
+                                                       R"([
+      [0,    2],
+      [null, 3],
+      [3,    4],
+      [5,    4],
+      [4,    null],
+      [3,    1],
+      [0,    2]
+    ])",
+                                                       R"([
+      [0,    2],
+      [1,    null],
+      [null, 3]
+    ])"};
+
+  const std::vector<std::string> temporal_table_json = {R"([
+      [null,      1],
+      [86400000,  1]
+    ])",
+                                                        R"([
+      [0,         2],
+      [null,      3],
+      [259200000, 4],
+      [432000000, 4],
+      [345600000, null],
+      [259200000, 1],
+      [0,         2]
+    ])",
+                                                        R"([
+      [0,         2],
+      [86400000,  null],
+      [null,      3]
+    ])"};
+
+  for (const auto& type : types) {
+    for (bool use_exec_plan : {true, false}) {
+      for (bool use_threads : {true, false}) {
+        SCOPED_TRACE(type->ToString());
+        auto in_schema = schema({field("argument0", type), field("key", int64())});
+        auto table =
+            TableFromJSON(in_schema, (type->name() == "date64") ? temporal_table_json
+                                                                : numeric_table_json);
+        ASSERT_OK_AND_ASSIGN(
+            Datum aggregated_and_grouped,
+            GroupByTest({table->GetColumnByName("argument0")},
+                        {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                        use_threads, use_exec_plan));
+        ValidateOutput(aggregated_and_grouped);
+        SortBy({"key_0"}, &aggregated_and_grouped);
+
+        const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+        //  Check the key column
+        AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, 4, null])"),
+                          struct_arr->field(struct_arr->num_fields() - 1));
+
+        //  Check values individually
+        const auto& col = struct_arr->field(0);
+        if (type->name() == "date64") {
+          EXPECT_THAT(col->GetScalar(0),
+                      ResultWith(AnyOfJSON(type, R"([86400000, 259200000])")));
+          EXPECT_THAT(col->GetScalar(1), ResultWith(AnyOfJSON(type, R"([0])")));
+          EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+          EXPECT_THAT(col->GetScalar(3),
+                      ResultWith(AnyOfJSON(type, R"([259200000, 432000000])")));
+          EXPECT_THAT(col->GetScalar(4),
+                      ResultWith(AnyOfJSON(type, R"([345600000, 86400000])")));
+        } else {
+          EXPECT_THAT(col->GetScalar(0), ResultWith(AnyOfJSON(type, R"([1, 3])")));
+          EXPECT_THAT(col->GetScalar(1), ResultWith(AnyOfJSON(type, R"([0])")));
+          EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+          EXPECT_THAT(col->GetScalar(3), ResultWith(AnyOfJSON(type, R"([3, 5])")));
+          EXPECT_THAT(col->GetScalar(4), ResultWith(AnyOfJSON(type, R"([4, 1])")));
+        }
+      }
+    }
+  }
+}
+
+TEST(GroupBy, OneBinaryTypes) {
+  for (bool use_exec_plan : {true, false}) {
+    for (bool use_threads : {true, false}) {
+      for (const auto& type : BaseBinaryTypes()) {
+        SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+        auto table = TableFromJSON(schema({
+                                       field("argument0", type),
+                                       field("key", int64()),
+                                   }),
+                                   {R"([
+    [null,   1],
+    ["aaaa", 1]
+])",
+                                    R"([
+    ["babcd",2],
+    [null,   3],
+    ["2",    null],
+    ["d",    1],
+    ["bc",   2]
+])",
+                                    R"([
+    ["bcd", 2],
+    ["123", null],
+    [null,  3]
+])"});
+
+        ASSERT_OK_AND_ASSIGN(
+            Datum aggregated_and_grouped,
+            GroupByTest({table->GetColumnByName("argument0")},
+                        {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                        use_threads, use_exec_plan));
+        ValidateOutput(aggregated_and_grouped);
+        SortBy({"key_0"}, &aggregated_and_grouped);
+
+        const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+        //  Check the key column
+        AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, null])"),
+                          struct_arr->field(struct_arr->num_fields() - 1));
+
+        const auto& col = struct_arr->field(0);
+        EXPECT_THAT(col->GetScalar(0), ResultWith(AnyOfJSON(type, R"(["aaaa", "d"])")));
+        EXPECT_THAT(col->GetScalar(1),
+                    ResultWith(AnyOfJSON(type, R"(["bcd", "bc", "babcd"])")));
+        EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+        EXPECT_THAT(col->GetScalar(3), ResultWith(AnyOfJSON(type, R"(["2", "123"])")));
+      }
+    }
+  }
+}
+
+TEST(GroupBy, OneScalar) {

Review comment:
       Using `ValueDescr::Scalar()` in all three `ExecBatchFromJSON()` calls produces the same outputs are above too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r808489192



##########
File path: cpp/src/arrow/testing/matchers.h
##########
@@ -61,6 +61,65 @@ class PointeesEqualMatcher {
 // Useful in conjunction with other googletest matchers.
 inline PointeesEqualMatcher PointeesEqual() { return {}; }
 
+class AnyOfJSONMatcher {

Review comment:
       (So to be clear: this is just a comment and I don't expect we should change this)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#issuecomment-1042943111


   Benchmark runs are scheduled for baseline = ed25c616c6270142cce0a2a36c7474e28e167184 and contender = 74f512260fa69903feac61e1287f6954a3d98204. 74f512260fa69903feac61e1287f6954a3d98204 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a94a3f0000b84a85a4d19562ff559187...7b33e4d8929944749d78fd64bf5054fa/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/15b82b3edbb4496baefc9bb7f57453c9...34fcb6b914b4436fae5816b099686be4/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/11c0b23a67e6433c9358d07ced046614...2f00fea949d543acae398e06ffb8787f/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/31368249ed3448b5b3ec5a0396533d0b...ce98c7d67ad84386b4028c02679b0d02/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r801678902



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,92 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+struct GroupedOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    pool_ = ctx->memory_pool();
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    ARROW_ASSIGN_OR_RAISE(std::ignore, grouper_->Consume(batch));
+    return Status::OK();
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    // Get (value, group_id) pairs, then translate the group IDs and consume them
+    // ourselves
+    ARROW_ASSIGN_OR_RAISE(auto uniques, other->grouper_->GetUniques());
+    ARROW_ASSIGN_OR_RAISE(auto remapped_g,
+                          AllocateBuffer(uniques.length * sizeof(uint32_t), pool_));
+
+    const auto* g_mapping = group_id_mapping.GetValues<uint32_t>(1);
+    const auto* other_g = uniques[1].array()->GetValues<uint32_t>(1);
+    auto* g = reinterpret_cast<uint32_t*>(remapped_g->mutable_data());
+
+    for (int64_t i = 0; i < uniques.length; i++) {
+      g[i] = g_mapping[other_g[i]];
+    }
+    uniques.values[1] =
+        ArrayData::Make(uint32(), uniques.length, {nullptr, std::move(remapped_g)});
+
+    return Consume(std::move(uniques));
+  }
+
+  Result<Datum> Finalize() override {

Review comment:
       Most of the stuff is carryover from `DistinctCount`/`Distinct` as I don't fully understand their function (such as merge). Have added a basic test to confirm this is what we what. Also, have taken a naive `ArrayBuilder` approach to get this basics working. What would some better approaches be?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r805961512



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);

Review comment:
       This is what the `EXPECT_THAT` and `AnyOf` approach might look like, for testing _just one_ column.

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);
+
+      CountOptions all(CountOptions::ALL);
+      ASSERT_OK_AND_ASSIGN(
+          auto distinct_out,
+          internal::GroupBy(
+              {
+                  table->GetColumnByName("argument0"),
+                  table->GetColumnByName("argument1"),
+                  table->GetColumnByName("argument2"),
+              },
+              {
+                  table->GetColumnByName("key"),
+              },
+              {{"hash_distinct", &all}, {"hash_distinct", &all}, {"hash_distinct", &all}},
+              use_threads));
+      ValidateOutput(distinct_out);
+      SortBy({"key_0"}, &distinct_out);
+
+      const auto& struct_arr_distinct = distinct_out.array_as<StructArray>();
+      for (int64_t col = 0; col < struct_arr_distinct->length() - 1; ++col) {
+        const auto matcher = AnyOfScalarFromUniques(
+            checked_pointer_cast<ListArray>(struct_arr_distinct->field(col)));
+        EXPECT_THAT(struct_arr->field(col), matcher);
+      }

Review comment:
       While this tests _all_ the columns (the key column is not a `ListArray` so will have to be tested manually, but that's non-trivial). So if using other kernels is not strictly discouraged to write test, this is a rather clean way of doing this. @lidavidm 

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);
+
+      CountOptions all(CountOptions::ALL);
+      ASSERT_OK_AND_ASSIGN(
+          auto distinct_out,
+          internal::GroupBy(
+              {
+                  table->GetColumnByName("argument0"),
+                  table->GetColumnByName("argument1"),
+                  table->GetColumnByName("argument2"),
+              },
+              {
+                  table->GetColumnByName("key"),
+              },
+              {{"hash_distinct", &all}, {"hash_distinct", &all}, {"hash_distinct", &all}},
+              use_threads));
+      ValidateOutput(distinct_out);
+      SortBy({"key_0"}, &distinct_out);
+
+      const auto& struct_arr_distinct = distinct_out.array_as<StructArray>();
+      for (int64_t col = 0; col < struct_arr_distinct->length() - 1; ++col) {
+        const auto matcher = AnyOfScalarFromUniques(
+            checked_pointer_cast<ListArray>(struct_arr_distinct->field(col)));
+        EXPECT_THAT(struct_arr->field(col), matcher);
+      }

Review comment:
       While this tests _all_ the columns (the key column is not a `ListArray` so will have to be tested manually, but that's trivial). So if using other kernels is not strictly discouraged to write test, this is a rather clean way of doing this. @lidavidm 

##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,558 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+MATCHER_P(AnyOfScalar, arrow_array, "") {
+  for (int64_t i = 0; i < arrow_array->length(); ++i) {
+    auto scalar = arrow_array->GetScalar(i).ValueOrDie();
+    if (scalar->Equals(arg)) return true;
+  }
+  *result_listener << "Argument scalar: '" << arg->ToString()
+                   << "' matches no input scalar.";
+  return false;
+}
+
+MATCHER_P(AnyOfScalarFromUniques, unique_list, "") {
+  const auto& flatten = unique_list->Flatten().ValueOrDie();
+  const auto& offsets = std::dynamic_pointer_cast<Int32Array>(unique_list->offsets());
+
+  for (int64_t i = 0; i < arg->length(); ++i) {
+    bool match_found = false;
+    const auto group_hash_one = arg->GetScalar(i).ValueOrDie();
+    int64_t start = offsets->Value(i);
+    int64_t end = offsets->Value(i + 1);
+    for (int64_t j = start; j < end; ++j) {
+      auto s = flatten->GetScalar(j).ValueOrDie();
+      if (s->Equals(group_hash_one)) {
+        match_found = true;
+        break;
+      }
+    }
+    if (!match_found) {
+      *result_listener << "Argument scalar: '" << group_hash_one->ToString()
+                       << "' matches no input scalar.";
+      return false;
+    }
+  }
+  return true;
+}
+
+TEST(GroupBy, One) {
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", int64()), field("key", int64())}), {R"([
+    [99,  1],
+    [99,  1]
+])",
+                                                                                    R"([
+    [77,  2],
+    [null,   3],
+    [null,   3]
+])",
+                                                                                    R"([
+    [null,   4],
+    [null,   4]
+])",
+                                                                                  R"([
+    [88,  null],
+    [99,  3]
+])",
+                                                                                  R"([
+    [77,  2],
+    [76, 2]
+])",
+                                                                                  R"([
+    [75, null],
+    [74,  3]
+  ])",
+                                                                                  R"([
+    [73,    null],
+    [72,    null]
+  ])"});
+
+  ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                       internal::GroupBy(
+                           {
+                               table->GetColumnByName("argument"),
+                           },
+                           {
+                               table->GetColumnByName("key"),
+                           },
+                           {
+                               {"hash_one", nullptr},
+                           },
+                           false));
+  ValidateOutput(aggregated_and_grouped);
+  SortBy({"key_0"}, &aggregated_and_grouped);
+
+  AssertDatumsEqual(ArrayFromJSON(struct_({
+                                      field("hash_one", int64()),
+                                      field("key_0", int64()),
+                                  }),
+                                  R"([
+      [99, 1],
+      [77, 2],
+      [null,  3],
+      [null,  4],
+      [88, null]
+    ])"),
+                    aggregated_and_grouped,
+                    /*verbose=*/true);
+  }
+  {
+    auto table =
+        TableFromJSON(schema({field("argument", utf8()), field("key", int64())}), {R"([
+     ["foo",  1],
+     ["foo",  1]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     [null,   3],
+     [null,   3]
+ ])",
+                                                                                   R"([
+     [null,   4],
+     [null,   4]
+ ])",
+                                                                                   R"([
+     ["baz",  null],
+     ["foo",  3]
+ ])",
+                                                                                   R"([
+     ["bar",  2],
+     ["spam", 2]
+ ])",
+                                                                                   R"([
+     ["eggs", null],
+     ["ham",  3]
+   ])",
+                                                                                   R"([
+     ["a",    null],
+     ["b",    null]
+   ])"});
+
+    ASSERT_OK_AND_ASSIGN(auto aggregated_and_grouped,
+                         internal::GroupBy(
+                             {
+                                 table->GetColumnByName("argument"),
+                             },
+                             {
+                                 table->GetColumnByName("key"),
+                             },
+                             {
+                                 {"hash_one", nullptr},
+                             },
+                             false));
+    ValidateOutput(aggregated_and_grouped);
+    SortBy({"key_0"}, &aggregated_and_grouped);
+
+    AssertDatumsEqual(ArrayFromJSON(struct_({
+                                        field("hash_one", utf8()),
+                                        field("key_0", int64()),
+                                    }),
+                                    R"([
+       ["foo", 1],
+       ["bar", 2],
+       [null,  3],
+       [null,  4],
+       ["baz", null]
+     ])"),
+                      aggregated_and_grouped,
+                      /*verbose=*/true);
+  }
+}
+
+TEST(GroupBy, OneOnly) {
+  auto in_schema = schema({
+      field("argument0", float64()),
+      field("argument1", null()),
+      field("argument2", boolean()),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {false, true}) {
+    for (bool use_threads : {false, true}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [1.0,   null, true, 1],
+    [null,  null, true, 1]
+])",
+                                             R"([
+    [0.0,   null, false, 2],
+    [null,  null, false, 3],
+    [4.0,   null, null,  null],
+    [3.25,  null, true,  1],
+    [0.125, null, false, 2]
+])",
+                                             R"([
+    [-0.25, null, false, 2],
+    [0.75,  null, true,  null],
+    [null,  null, true,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("argument0"),
+                                   table->GetColumnByName("argument1"),
+                                   table->GetColumnByName("argument2"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      //      AssertDatumsEqual(ArrayFromJSON(struct_({
+      //                                          field("hash_one", float64()),
+      //                                          field("hash_one", null()),
+      //                                          field("hash_one", boolean()),
+      //                                          field("key_0", int64()),
+      //                                      }),
+      //                                      R"([
+      //          [1.0,  null, true,  1],
+      //          [0.0,  null, false, 2],
+      //          [null, null, false, 3],
+      //          [4.0,  null, null,  null]
+      //        ])"),
+      //                        aggregated_and_grouped,
+      //                        /*verbose=*/true);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), "[1, 2, 3, null]"), struct_arr->field(3));
+
+      auto type_col_0 = float64();
+      auto group_one_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([1.0, null, 3.25])"));
+      auto group_two_col_0 =
+          AnyOfScalar(ArrayFromJSON(type_col_0, R"([0.0, 0.125, -0.25])"));
+      auto group_three_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([null])"));
+      auto group_null_col_0 = AnyOfScalar(ArrayFromJSON(type_col_0, R"([4.0, 0.75])"));
+
+      //  Check values individually
+      const auto& col0 = struct_arr->field(0);
+      ASSERT_OK_AND_ASSIGN(const auto g_one, col0->GetScalar(0));
+      EXPECT_THAT(g_one, group_one_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_two, col0->GetScalar(1));
+      EXPECT_THAT(g_two, group_two_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_three, col0->GetScalar(2));
+      EXPECT_THAT(g_three, group_three_col_0);
+      ASSERT_OK_AND_ASSIGN(const auto g_null, col0->GetScalar(3));
+      EXPECT_THAT(g_null, group_null_col_0);
+
+      CountOptions all(CountOptions::ALL);
+      ASSERT_OK_AND_ASSIGN(
+          auto distinct_out,
+          internal::GroupBy(
+              {
+                  table->GetColumnByName("argument0"),
+                  table->GetColumnByName("argument1"),
+                  table->GetColumnByName("argument2"),
+              },
+              {
+                  table->GetColumnByName("key"),
+              },
+              {{"hash_distinct", &all}, {"hash_distinct", &all}, {"hash_distinct", &all}},
+              use_threads));
+      ValidateOutput(distinct_out);
+      SortBy({"key_0"}, &distinct_out);
+
+      const auto& struct_arr_distinct = distinct_out.array_as<StructArray>();
+      for (int64_t col = 0; col < struct_arr_distinct->length() - 1; ++col) {

Review comment:
       ```suggestion
         for (int64_t col = 0; col < struct_arr_distinct->num_fields() - 1; ++col) {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dhruv9vats commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

dhruv9vats commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r807884279



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,333 @@ Result<std::unique_ptr<KernelState>> GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+template <typename Type, typename Enable = void>
+struct GroupedOneImpl final : public GroupedAggregator {
+  using CType = typename TypeTraits<Type>::CType;
+  using GetSet = GroupedValueTraits<Type>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    // out_type_ initialized by GroupedOneInit
+    ones_ = TypedBufferBuilder<CType>(ctx->memory_pool());
+    has_one_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    num_groups_ = new_num_groups;
+    RETURN_NOT_OK(ones_.Append(added_groups, static_cast<CType>(0)));
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    auto raw_ones_ = ones_.mutable_data();
+
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, CType val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            GetSet::Set(raw_ones_, g, val);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    auto raw_ones = ones_.mutable_data();
+    auto other_raw_ones = other->ones_.mutable_data();
+
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          GetSet::Set(raw_ones, *g, GetSet::Get(other_raw_ones, other_g));
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    ARROW_ASSIGN_OR_RAISE(auto data, ones_.Finish());
+    return ArrayData::Make(out_type_, num_groups_,
+                           {std::move(null_bitmap), std::move(data)});
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return out_type_; }
+
+  int64_t num_groups_;
+  TypedBufferBuilder<CType> ones_;
+  TypedBufferBuilder<bool> has_one_, has_value_;
+  std::shared_ptr<DataType> out_type_;
+};
+
+struct GroupedNullOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override { return Status::OK(); }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    return ArrayData::Make(null(), num_groups_, {nullptr}, num_groups_);
+  }
+
+  std::shared_ptr<DataType> out_type() const override { return null(); }
+
+  int64_t num_groups_;
+};
+
+template <typename Type>
+struct GroupedOneImpl<Type, enable_if_t<is_base_binary_type<Type>::value ||
+                                        std::is_same<Type, FixedSizeBinaryType>::value>>
+    final : public GroupedAggregator {
+  using Allocator = arrow::stl::allocator<char>;
+  using StringType = std::basic_string<char, std::char_traits<char>, Allocator>;
+
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    allocator_ = Allocator(ctx->memory_pool());
+    // out_type_ initialized by GroupedOneInit
+    has_value_ = TypedBufferBuilder<bool>(ctx->memory_pool());
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    auto added_groups = new_num_groups - num_groups_;
+    DCHECK_GE(added_groups, 0);
+    num_groups_ = new_num_groups;
+    ones_.resize(new_num_groups);
+    RETURN_NOT_OK(has_one_.Append(added_groups, false));
+    RETURN_NOT_OK(has_value_.Append(added_groups, false));
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    return VisitGroupedValues<Type>(
+        batch,
+        [&](uint32_t g, util::string_view val) -> Status {
+          if (!bit_util::GetBit(has_one_.data(), g)) {
+            ones_[g].emplace(val.data(), val.size(), allocator_);
+            bit_util::SetBit(has_one_.mutable_data(), g);
+            bit_util::SetBit(has_value_.mutable_data(), g);
+          }
+          return Status::OK();
+        },
+        [&](uint32_t g) -> Status {
+          // as has_one_ is set, has_value_ will never be set, resulting in null
+          bit_util::SetBit(has_one_.mutable_data(), g);
+          return Status::OK();
+        });
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+    auto g = group_id_mapping.GetValues<uint32_t>(1);
+    for (uint32_t other_g = 0; static_cast<int64_t>(other_g) < group_id_mapping.length;
+         ++other_g, ++g) {
+      if (!bit_util::GetBit(has_one_.data(), *g)) {
+        if (bit_util::GetBit(other->has_value_.data(), other_g)) {
+          ones_[*g] = std::move(other->ones_[other_g]);
+          bit_util::SetBit(has_value_.mutable_data(), *g);
+        }
+        bit_util::SetBit(has_one_.mutable_data(), *g);
+      }
+    }
+    return Status::OK();
+  }
+
+  Result<Datum> Finalize() override {
+    ARROW_ASSIGN_OR_RAISE(auto null_bitmap, has_value_.Finish());
+    auto ones =
+        ArrayData::Make(out_type(), num_groups_, {std::move(null_bitmap), nullptr});
+    RETURN_NOT_OK(MakeOffsetsValues(ones.get(), ones_));
+    return ones;
+  }
+
+  template <typename T = Type>
+  enable_if_base_binary<T, Status> MakeOffsetsValues(

Review comment:
       What would this look like?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r808501334



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,294 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+TEST(GroupBy, OneMiscTypes) {
+  auto in_schema = schema({
+      field("floats", float64()),
+      field("nulls", null()),
+      field("booleans", boolean()),
+      field("decimal128", decimal128(3, 2)),
+      field("decimal256", decimal256(3, 2)),
+      field("fixed_binary", fixed_size_binary(3)),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {true, false}) {
+    for (bool use_threads : {true, false}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [null, null, true,   null,    null,    null,  1],
+    [1.0,  null, true,   "1.01",  "1.01",  "aaa", 1]
+])",
+                                             R"([
+    [0.0,   null, false, "0.00",  "0.00",  "bac", 2],
+    [null,  null, false, null,    null,    null,  3],
+    [4.0,   null, null,  "4.01",  "4.01",  "234", null],
+    [3.25,  null, true,  "3.25",  "3.25",  "ddd", 1],
+    [0.125, null, false, "0.12",  "0.12",  "bcd", 2]
+])",
+                                             R"([
+    [-0.25, null, false, "-0.25", "-0.25", "bab", 2],
+    [0.75,  null, true,  "0.75",  "0.75",  "123", null],
+    [null,  null, true,  null,    null,    null,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("floats"),
+                                   table->GetColumnByName("nulls"),
+                                   table->GetColumnByName("booleans"),
+                                   table->GetColumnByName("decimal128"),
+                                   table->GetColumnByName("decimal256"),
+                                   table->GetColumnByName("fixed_binary"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, null])"),
+                        struct_arr->field(struct_arr->num_fields() - 1));
+
+      //  Check values individually
+      auto col_0_type = float64();
+      const auto& col_0 = struct_arr->field(0);
+      EXPECT_THAT(col_0->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_0_type, R"([1.0, 3.25])")));
+      EXPECT_THAT(col_0->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_0_type, R"([0.0, 0.125, -0.25])")));
+      EXPECT_THAT(col_0->GetScalar(2), ResultWith(AnyOfJSON(col_0_type, R"([null])")));
+      EXPECT_THAT(col_0->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_0_type, R"([4.0, 0.75])")));
+
+      auto col_1_type = null();
+      const auto& col_1 = struct_arr->field(1);
+      EXPECT_THAT(col_1->GetScalar(0), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(1), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(2), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(3), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+
+      auto col_2_type = boolean();
+      const auto& col_2 = struct_arr->field(2);
+      EXPECT_THAT(col_2->GetScalar(0), ResultWith(AnyOfJSON(col_2_type, R"([true])")));
+      EXPECT_THAT(col_2->GetScalar(1), ResultWith(AnyOfJSON(col_2_type, R"([false])")));
+      EXPECT_THAT(col_2->GetScalar(2),
+                  ResultWith(AnyOfJSON(col_2_type, R"([true, false])")));
+      EXPECT_THAT(col_2->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_2_type, R"([true, null])")));
+
+      auto col_3_type = decimal128(3, 2);
+      const auto& col_3 = struct_arr->field(3);
+      EXPECT_THAT(col_3->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["1.01", "3.25"])")));
+      EXPECT_THAT(col_3->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["0.00", "0.12", "-0.25"])")));
+      EXPECT_THAT(col_3->GetScalar(2), ResultWith(AnyOfJSON(col_3_type, R"([null])")));
+      EXPECT_THAT(col_3->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["4.01", "0.75"])")));
+
+      auto col_4_type = decimal256(3, 2);
+      const auto& col_4 = struct_arr->field(4);
+      EXPECT_THAT(col_4->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["1.01", "3.25"])")));
+      EXPECT_THAT(col_4->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["0.00", "0.12", "-0.25"])")));
+      EXPECT_THAT(col_4->GetScalar(2), ResultWith(AnyOfJSON(col_4_type, R"([null])")));
+      EXPECT_THAT(col_4->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["4.01", "0.75"])")));
+
+      auto col_5_type = fixed_size_binary(3);
+      const auto& col_5 = struct_arr->field(5);
+      EXPECT_THAT(col_5->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["aaa", "ddd"])")));
+      EXPECT_THAT(col_5->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["bab", "bcd", "bac"])")));
+      EXPECT_THAT(col_5->GetScalar(2), ResultWith(AnyOfJSON(col_5_type, R"([null])")));
+      EXPECT_THAT(col_5->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["123", "234"])")));
+    }
+  }
+}
+
+TEST(GroupBy, OneNumericTypes) {
+  std::vector<std::shared_ptr<DataType>> types;
+  types.insert(types.end(), NumericTypes().begin(), NumericTypes().end());
+  types.insert(types.end(), TemporalTypes().begin(), TemporalTypes().end());
+  types.push_back(month_interval());
+
+  const std::vector<std::string> numeric_table_json = {R"([
+      [null, 1],
+      [1,    1]
+    ])",
+                                                       R"([
+      [0,    2],
+      [null, 3],
+      [3,    4],
+      [5,    4],
+      [4,    null],
+      [3,    1],
+      [0,    2]
+    ])",
+                                                       R"([
+      [0,    2],
+      [1,    null],
+      [null, 3]
+    ])"};
+
+  const std::vector<std::string> temporal_table_json = {R"([
+      [null,      1],
+      [86400000,  1]
+    ])",
+                                                        R"([
+      [0,         2],
+      [null,      3],
+      [259200000, 4],
+      [432000000, 4],
+      [345600000, null],
+      [259200000, 1],
+      [0,         2]
+    ])",
+                                                        R"([
+      [0,         2],
+      [86400000,  null],
+      [null,      3]
+    ])"};
+
+  for (const auto& type : types) {
+    for (bool use_exec_plan : {true, false}) {
+      for (bool use_threads : {true, false}) {
+        SCOPED_TRACE(type->ToString());
+        auto in_schema = schema({field("argument0", type), field("key", int64())});
+        auto table =
+            TableFromJSON(in_schema, (type->name() == "date64") ? temporal_table_json
+                                                                : numeric_table_json);
+        ASSERT_OK_AND_ASSIGN(
+            Datum aggregated_and_grouped,
+            GroupByTest({table->GetColumnByName("argument0")},
+                        {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                        use_threads, use_exec_plan));
+        ValidateOutput(aggregated_and_grouped);
+        SortBy({"key_0"}, &aggregated_and_grouped);
+
+        const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+        //  Check the key column
+        AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, 4, null])"),
+                          struct_arr->field(struct_arr->num_fields() - 1));
+
+        //  Check values individually
+        const auto& col = struct_arr->field(0);
+        if (type->name() == "date64") {
+          EXPECT_THAT(col->GetScalar(0),
+                      ResultWith(AnyOfJSON(type, R"([86400000, 259200000])")));
+          EXPECT_THAT(col->GetScalar(1), ResultWith(AnyOfJSON(type, R"([0])")));
+          EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+          EXPECT_THAT(col->GetScalar(3),
+                      ResultWith(AnyOfJSON(type, R"([259200000, 432000000])")));
+          EXPECT_THAT(col->GetScalar(4),
+                      ResultWith(AnyOfJSON(type, R"([345600000, 86400000])")));
+        } else {
+          EXPECT_THAT(col->GetScalar(0), ResultWith(AnyOfJSON(type, R"([1, 3])")));
+          EXPECT_THAT(col->GetScalar(1), ResultWith(AnyOfJSON(type, R"([0])")));
+          EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+          EXPECT_THAT(col->GetScalar(3), ResultWith(AnyOfJSON(type, R"([3, 5])")));
+          EXPECT_THAT(col->GetScalar(4), ResultWith(AnyOfJSON(type, R"([4, 1])")));
+        }
+      }
+    }
+  }
+}
+
+TEST(GroupBy, OneBinaryTypes) {
+  for (bool use_exec_plan : {true, false}) {
+    for (bool use_threads : {true, false}) {
+      for (const auto& type : BaseBinaryTypes()) {
+        SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+        auto table = TableFromJSON(schema({
+                                       field("argument0", type),
+                                       field("key", int64()),
+                                   }),
+                                   {R"([
+    [null,   1],
+    ["aaaa", 1]
+])",
+                                    R"([
+    ["babcd",2],
+    [null,   3],
+    ["2",    null],
+    ["d",    1],
+    ["bc",   2]
+])",
+                                    R"([
+    ["bcd", 2],
+    ["123", null],
+    [null,  3]
+])"});
+
+        ASSERT_OK_AND_ASSIGN(
+            Datum aggregated_and_grouped,
+            GroupByTest({table->GetColumnByName("argument0")},
+                        {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                        use_threads, use_exec_plan));
+        ValidateOutput(aggregated_and_grouped);
+        SortBy({"key_0"}, &aggregated_and_grouped);
+
+        const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+        //  Check the key column
+        AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, null])"),
+                          struct_arr->field(struct_arr->num_fields() - 1));
+
+        const auto& col = struct_arr->field(0);
+        EXPECT_THAT(col->GetScalar(0), ResultWith(AnyOfJSON(type, R"(["aaaa", "d"])")));
+        EXPECT_THAT(col->GetScalar(1),
+                    ResultWith(AnyOfJSON(type, R"(["bcd", "bc", "babcd"])")));
+        EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+        EXPECT_THAT(col->GetScalar(3), ResultWith(AnyOfJSON(type, R"(["2", "123"])")));
+      }
+    }
+  }
+}
+
+TEST(GroupBy, OneScalar) {

Review comment:
       ExecBatch, for each column, can either hold an array (as we've seen so far) or a scalar. A scalar is used to compress the representation when all values for a row in a batch are the same. So first off, the first value in each row above needs to be the exact same value:
   
   ```
   input.batches = {
         ExecBatchFromJSON({ValueDescr::Scalar(int32()), int64()},
                           R"([[-1, 1], [-1, 1], [-1, 1], [-1, 1]])"),
   ```
   
   however just because one batch has a scalar doesn't mean that all batches do, that's why the third batch is different above




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r808483788



##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
##########
@@ -2460,6 +2461,294 @@ TEST(GroupBy, Distinct) {
   }
 }
 
+TEST(GroupBy, OneMiscTypes) {
+  auto in_schema = schema({
+      field("floats", float64()),
+      field("nulls", null()),
+      field("booleans", boolean()),
+      field("decimal128", decimal128(3, 2)),
+      field("decimal256", decimal256(3, 2)),
+      field("fixed_binary", fixed_size_binary(3)),
+      field("key", int64()),
+  });
+  for (bool use_exec_plan : {true, false}) {
+    for (bool use_threads : {true, false}) {
+      SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+      auto table = TableFromJSON(in_schema, {R"([
+    [null, null, true,   null,    null,    null,  1],
+    [1.0,  null, true,   "1.01",  "1.01",  "aaa", 1]
+])",
+                                             R"([
+    [0.0,   null, false, "0.00",  "0.00",  "bac", 2],
+    [null,  null, false, null,    null,    null,  3],
+    [4.0,   null, null,  "4.01",  "4.01",  "234", null],
+    [3.25,  null, true,  "3.25",  "3.25",  "ddd", 1],
+    [0.125, null, false, "0.12",  "0.12",  "bcd", 2]
+])",
+                                             R"([
+    [-0.25, null, false, "-0.25", "-0.25", "bab", 2],
+    [0.75,  null, true,  "0.75",  "0.75",  "123", null],
+    [null,  null, true,  null,    null,    null,  3]
+])"});
+
+      ASSERT_OK_AND_ASSIGN(Datum aggregated_and_grouped,
+                           GroupByTest(
+                               {
+                                   table->GetColumnByName("floats"),
+                                   table->GetColumnByName("nulls"),
+                                   table->GetColumnByName("booleans"),
+                                   table->GetColumnByName("decimal128"),
+                                   table->GetColumnByName("decimal256"),
+                                   table->GetColumnByName("fixed_binary"),
+                               },
+                               {table->GetColumnByName("key")},
+                               {
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                                   {"hash_one", nullptr},
+                               },
+                               use_threads, use_exec_plan));
+      ValidateOutput(aggregated_and_grouped);
+      SortBy({"key_0"}, &aggregated_and_grouped);
+
+      const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+      //  Check the key column
+      AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, null])"),
+                        struct_arr->field(struct_arr->num_fields() - 1));
+
+      //  Check values individually
+      auto col_0_type = float64();
+      const auto& col_0 = struct_arr->field(0);
+      EXPECT_THAT(col_0->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_0_type, R"([1.0, 3.25])")));
+      EXPECT_THAT(col_0->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_0_type, R"([0.0, 0.125, -0.25])")));
+      EXPECT_THAT(col_0->GetScalar(2), ResultWith(AnyOfJSON(col_0_type, R"([null])")));
+      EXPECT_THAT(col_0->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_0_type, R"([4.0, 0.75])")));
+
+      auto col_1_type = null();
+      const auto& col_1 = struct_arr->field(1);
+      EXPECT_THAT(col_1->GetScalar(0), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(1), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(2), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+      EXPECT_THAT(col_1->GetScalar(3), ResultWith(AnyOfJSON(col_1_type, R"([null])")));
+
+      auto col_2_type = boolean();
+      const auto& col_2 = struct_arr->field(2);
+      EXPECT_THAT(col_2->GetScalar(0), ResultWith(AnyOfJSON(col_2_type, R"([true])")));
+      EXPECT_THAT(col_2->GetScalar(1), ResultWith(AnyOfJSON(col_2_type, R"([false])")));
+      EXPECT_THAT(col_2->GetScalar(2),
+                  ResultWith(AnyOfJSON(col_2_type, R"([true, false])")));
+      EXPECT_THAT(col_2->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_2_type, R"([true, null])")));
+
+      auto col_3_type = decimal128(3, 2);
+      const auto& col_3 = struct_arr->field(3);
+      EXPECT_THAT(col_3->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["1.01", "3.25"])")));
+      EXPECT_THAT(col_3->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["0.00", "0.12", "-0.25"])")));
+      EXPECT_THAT(col_3->GetScalar(2), ResultWith(AnyOfJSON(col_3_type, R"([null])")));
+      EXPECT_THAT(col_3->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_3_type, R"(["4.01", "0.75"])")));
+
+      auto col_4_type = decimal256(3, 2);
+      const auto& col_4 = struct_arr->field(4);
+      EXPECT_THAT(col_4->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["1.01", "3.25"])")));
+      EXPECT_THAT(col_4->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["0.00", "0.12", "-0.25"])")));
+      EXPECT_THAT(col_4->GetScalar(2), ResultWith(AnyOfJSON(col_4_type, R"([null])")));
+      EXPECT_THAT(col_4->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_4_type, R"(["4.01", "0.75"])")));
+
+      auto col_5_type = fixed_size_binary(3);
+      const auto& col_5 = struct_arr->field(5);
+      EXPECT_THAT(col_5->GetScalar(0),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["aaa", "ddd"])")));
+      EXPECT_THAT(col_5->GetScalar(1),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["bab", "bcd", "bac"])")));
+      EXPECT_THAT(col_5->GetScalar(2), ResultWith(AnyOfJSON(col_5_type, R"([null])")));
+      EXPECT_THAT(col_5->GetScalar(3),
+                  ResultWith(AnyOfJSON(col_5_type, R"(["123", "234"])")));
+    }
+  }
+}
+
+TEST(GroupBy, OneNumericTypes) {
+  std::vector<std::shared_ptr<DataType>> types;
+  types.insert(types.end(), NumericTypes().begin(), NumericTypes().end());
+  types.insert(types.end(), TemporalTypes().begin(), TemporalTypes().end());
+  types.push_back(month_interval());
+
+  const std::vector<std::string> numeric_table_json = {R"([
+      [null, 1],
+      [1,    1]
+    ])",
+                                                       R"([
+      [0,    2],
+      [null, 3],
+      [3,    4],
+      [5,    4],
+      [4,    null],
+      [3,    1],
+      [0,    2]
+    ])",
+                                                       R"([
+      [0,    2],
+      [1,    null],
+      [null, 3]
+    ])"};
+
+  const std::vector<std::string> temporal_table_json = {R"([
+      [null,      1],
+      [86400000,  1]
+    ])",
+                                                        R"([
+      [0,         2],
+      [null,      3],
+      [259200000, 4],
+      [432000000, 4],
+      [345600000, null],
+      [259200000, 1],
+      [0,         2]
+    ])",
+                                                        R"([
+      [0,         2],
+      [86400000,  null],
+      [null,      3]
+    ])"};
+
+  for (const auto& type : types) {
+    for (bool use_exec_plan : {true, false}) {
+      for (bool use_threads : {true, false}) {
+        SCOPED_TRACE(type->ToString());
+        auto in_schema = schema({field("argument0", type), field("key", int64())});
+        auto table =
+            TableFromJSON(in_schema, (type->name() == "date64") ? temporal_table_json
+                                                                : numeric_table_json);
+        ASSERT_OK_AND_ASSIGN(
+            Datum aggregated_and_grouped,
+            GroupByTest({table->GetColumnByName("argument0")},
+                        {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                        use_threads, use_exec_plan));
+        ValidateOutput(aggregated_and_grouped);
+        SortBy({"key_0"}, &aggregated_and_grouped);
+
+        const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+        //  Check the key column
+        AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, 4, null])"),
+                          struct_arr->field(struct_arr->num_fields() - 1));
+
+        //  Check values individually
+        const auto& col = struct_arr->field(0);
+        if (type->name() == "date64") {
+          EXPECT_THAT(col->GetScalar(0),
+                      ResultWith(AnyOfJSON(type, R"([86400000, 259200000])")));
+          EXPECT_THAT(col->GetScalar(1), ResultWith(AnyOfJSON(type, R"([0])")));
+          EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+          EXPECT_THAT(col->GetScalar(3),
+                      ResultWith(AnyOfJSON(type, R"([259200000, 432000000])")));
+          EXPECT_THAT(col->GetScalar(4),
+                      ResultWith(AnyOfJSON(type, R"([345600000, 86400000])")));
+        } else {
+          EXPECT_THAT(col->GetScalar(0), ResultWith(AnyOfJSON(type, R"([1, 3])")));
+          EXPECT_THAT(col->GetScalar(1), ResultWith(AnyOfJSON(type, R"([0])")));
+          EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+          EXPECT_THAT(col->GetScalar(3), ResultWith(AnyOfJSON(type, R"([3, 5])")));
+          EXPECT_THAT(col->GetScalar(4), ResultWith(AnyOfJSON(type, R"([4, 1])")));
+        }
+      }
+    }
+  }
+}
+
+TEST(GroupBy, OneBinaryTypes) {
+  for (bool use_exec_plan : {true, false}) {
+    for (bool use_threads : {true, false}) {
+      for (const auto& type : BaseBinaryTypes()) {
+        SCOPED_TRACE(use_threads ? "parallel/merged" : "serial");
+
+        auto table = TableFromJSON(schema({
+                                       field("argument0", type),
+                                       field("key", int64()),
+                                   }),
+                                   {R"([
+    [null,   1],
+    ["aaaa", 1]
+])",
+                                    R"([
+    ["babcd",2],
+    [null,   3],
+    ["2",    null],
+    ["d",    1],
+    ["bc",   2]
+])",
+                                    R"([
+    ["bcd", 2],
+    ["123", null],
+    [null,  3]
+])"});
+
+        ASSERT_OK_AND_ASSIGN(
+            Datum aggregated_and_grouped,
+            GroupByTest({table->GetColumnByName("argument0")},
+                        {table->GetColumnByName("key")}, {{"hash_one", nullptr}},
+                        use_threads, use_exec_plan));
+        ValidateOutput(aggregated_and_grouped);
+        SortBy({"key_0"}, &aggregated_and_grouped);
+
+        const auto& struct_arr = aggregated_and_grouped.array_as<StructArray>();
+        //  Check the key column
+        AssertDatumsEqual(ArrayFromJSON(int64(), R"([1, 2, 3, null])"),
+                          struct_arr->field(struct_arr->num_fields() - 1));
+
+        const auto& col = struct_arr->field(0);
+        EXPECT_THAT(col->GetScalar(0), ResultWith(AnyOfJSON(type, R"(["aaaa", "d"])")));
+        EXPECT_THAT(col->GetScalar(1),
+                    ResultWith(AnyOfJSON(type, R"(["bcd", "bc", "babcd"])")));
+        EXPECT_THAT(col->GetScalar(2), ResultWith(AnyOfJSON(type, R"([null])")));
+        EXPECT_THAT(col->GetScalar(3), ResultWith(AnyOfJSON(type, R"(["2", "123"])")));
+      }
+    }
+  }
+}
+
+TEST(GroupBy, OneScalar) {

Review comment:
       Hmm, this needs to be something like `ValueDescr::Scalar(type)`. See for instance CountScalar: https://github.com/apache/arrow/blob/7236f48d7c534802ebd84daa709aeaba070d6780/cpp/src/arrow/compute/kernels/hash_aggregate_test.cc#L846-L884

##########
File path: cpp/src/arrow/testing/matchers.h
##########
@@ -61,6 +61,65 @@ class PointeesEqualMatcher {
 // Useful in conjunction with other googletest matchers.
 inline PointeesEqualMatcher PointeesEqual() { return {}; }
 
+class AnyOfJSONMatcher {

Review comment:
       Thanks! I think it would technically compose a little better if it took the array instead of type + string but that's not a big deal (we can see if it's useful elsewhere first).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#issuecomment-1042943111


   Benchmark runs are scheduled for baseline = ed25c616c6270142cce0a2a36c7474e28e167184 and contender = 74f512260fa69903feac61e1287f6954a3d98204. 74f512260fa69903feac61e1287f6954a3d98204 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a94a3f0000b84a85a4d19562ff559187...7b33e4d8929944749d78fd64bf5054fa/)
   [Finished :arrow_down:1.48% :arrow_up:0.13%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/15b82b3edbb4496baefc9bb7f57453c9...34fcb6b914b4436fae5816b099686be4/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/11c0b23a67e6433c9358d07ced046614...2f00fea949d543acae398e06ffb8787f/)
   [Finished :arrow_down:0.21% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/31368249ed3448b5b3ec5a0396533d0b...ce98c7d67ad84386b4028c02679b0d02/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#issuecomment-1042943111


   Benchmark runs are scheduled for baseline = ed25c616c6270142cce0a2a36c7474e28e167184 and contender = 74f512260fa69903feac61e1287f6954a3d98204. 74f512260fa69903feac61e1287f6954a3d98204 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a94a3f0000b84a85a4d19562ff559187...7b33e4d8929944749d78fd64bf5054fa/)
   [Finished :arrow_down:1.48% :arrow_up:0.13%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/15b82b3edbb4496baefc9bb7f57453c9...34fcb6b914b4436fae5816b099686be4/)
   [Failed :arrow_down:9.39% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/11c0b23a67e6433c9358d07ced046614...2f00fea949d543acae398e06ffb8787f/)
   [Finished :arrow_down:0.21% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/31368249ed3448b5b3ec5a0396533d0b...ce98c7d67ad84386b4028c02679b0d02/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org