You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/29 05:09:59 UTC

[GitHub] [arrow] arw2019 opened a new pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

arw2019 opened a new pull request #8294:
URL: https://github.com/apache/arrow/pull/8294


   As discussed on Jira this is a short-circuiting Max for booleans implemented on top of the existing `min_max` kernel.
   
   As is there are no options: null is always taken to evaluate to false. 
   
   If we want to include control over null handling I can either add `Options` to the kernel or I can implement an `any_kleene` kernel by analogy with the `and_kleene` and `or_kleene` logical kernels that we have. 
   
   In Python the two options would look like:
   
   ``` python
   In []: a = pa.array([True, None], type='bool') 
       ...:  
       ...: # option 1 
       ...: pc.any(a).as_py() is True 
       ...: pc.any_kleene(a).as_py() is None 
       ...:  
       ...: # option 2 
       ...: pc.any(null_handling='skip') is True 
       ...: pc.any(null_handling='emit_null') is None                                                     
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #8294:
URL: https://github.com/apache/arrow/pull/8294


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] arw2019 commented on pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
arw2019 commented on pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#issuecomment-725852668


   Turning to draft while I work on the rebase


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] arw2019 commented on pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
arw2019 commented on pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#issuecomment-733260553


   thanks @pitrou @jorisvandenbossche @wesm for reviewing!!!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] arw2019 commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
arw2019 commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r497262235



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -547,6 +547,36 @@ struct BooleanMinMaxImpl : public MinMaxImpl<BooleanType, SimdLevel> {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct BooleanAnyImpl : public MinMaxImpl<BooleanType, SimdLevel> {
+  using StateType = MinMaxState<BooleanType, SimdLevel>;
+  using ArrayType = typename TypeTraits<BooleanType>::ArrayType;
+  using MinMaxImpl<BooleanType, SimdLevel>::MinMaxImpl;
+
+  void Consume(KernelContext*, const ExecBatch& batch) override {
+    // short-circuit if seen a True already
+    if (this->state.max == true) {
+      return;
+    }
+
+    ArrayType arr(batch[0].array());
+    const auto true_count = arr.true_count();
+    if (true_count > 0) {
+      this->state.max = true;
+    }
+  }
+
+  void Finalize(KernelContext*, Datum* out) override {
+    using ScalarType = typename TypeTraits<BooleanType>::ScalarType;

Review comment:
       Switched to that here (and in other places)

##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -547,6 +547,36 @@ struct BooleanMinMaxImpl : public MinMaxImpl<BooleanType, SimdLevel> {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct BooleanAnyImpl : public MinMaxImpl<BooleanType, SimdLevel> {
+  using StateType = MinMaxState<BooleanType, SimdLevel>;
+  using ArrayType = typename TypeTraits<BooleanType>::ArrayType;
+  using MinMaxImpl<BooleanType, SimdLevel>::MinMaxImpl;
+
+  void Consume(KernelContext*, const ExecBatch& batch) override {
+    // short-circuit if seen a True already
+    if (this->state.max == true) {
+      return;
+    }
+
+    ArrayType arr(batch[0].array());

Review comment:
       Switched to that here (and in other places)

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,109 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+  using Traits = TypeTraits<BooleanType>;
+  using ArrayType = typename Traits::ArrayType;
+  using c_type = typename BooleanType::c_type;
+  using ScalarType = typename Traits::ScalarType;

Review comment:
       Got rid of all these

##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -547,6 +547,36 @@ struct BooleanMinMaxImpl : public MinMaxImpl<BooleanType, SimdLevel> {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct BooleanAnyImpl : public MinMaxImpl<BooleanType, SimdLevel> {
+  using StateType = MinMaxState<BooleanType, SimdLevel>;
+  using ArrayType = typename TypeTraits<BooleanType>::ArrayType;
+  using MinMaxImpl<BooleanType, SimdLevel>::MinMaxImpl;
+
+  void Consume(KernelContext*, const ExecBatch& batch) override {
+    // short-circuit if seen a True already
+    if (this->state.max == true) {
+      return;
+    }
+
+    ArrayType arr(batch[0].array());
+    const auto true_count = arr.true_count();
+    if (true_count > 0) {
+      this->state.max = true;
+    }
+  }
+
+  void Finalize(KernelContext*, Datum* out) override {
+    using ScalarType = typename TypeTraits<BooleanType>::ScalarType;
+
+    if (this->state.max == true) {
+      out->value = std::make_shared<ScalarType>(true);
+    } else {
+      out->value = std::make_shared<ScalarType>(false);
+    }

Review comment:
       Done!

##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -547,6 +547,36 @@ struct BooleanMinMaxImpl : public MinMaxImpl<BooleanType, SimdLevel> {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct BooleanAnyImpl : public MinMaxImpl<BooleanType, SimdLevel> {
+  using StateType = MinMaxState<BooleanType, SimdLevel>;

Review comment:
       Done

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,109 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+  using Traits = TypeTraits<BooleanType>;
+  using ArrayType = typename Traits::ArrayType;
+  using c_type = typename BooleanType::c_type;
+  using ScalarType = typename Traits::ScalarType;
+
+ public:
+  void AssertAnyIs(const Datum& array, c_type expected) {
+    ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+    const ScalarType& out_any = out.scalar_as<ScalarType>();
+    const auto expected_any = static_cast<const ScalarType>(expected);
+    ASSERT_EQ(out_any, expected_any);
+  }
+
+  void AssertAnyIs(const std::string& json, c_type expected) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  void AssertAnyIs(const std::vector<std::string>& json, c_type expected) {
+    auto array = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+class TestAnyKernel : public TestPrimitiveAnyKernel {};
+
+TEST_F(TestAnyKernel, Basics) {
+  std::vector<std::string> chunked_input0 = {"[]", "[true]"};
+  std::vector<std::string> chunked_input1 = {"[true, true, null]", "[true, null]"};
+  std::vector<std::string> chunked_input2 = {"[false, false, false]", "[false]"};
+  std::vector<std::string> chunked_input3 = {"[false, null]", "[null, false]"};
+
+  this->AssertAnyIs("[false]", false);
+  this->AssertAnyIs("[true, false]", true);
+  this->AssertAnyIs("[null, null, null]", false);
+  this->AssertAnyIs("[false, false, false]", false);
+  this->AssertAnyIs("[false, false, false, null]", false);
+  this->AssertAnyIs("[true, null, true, true]", true);
+  this->AssertAnyIs("[false, null, false, true]", true);
+  this->AssertAnyIs("[true, null, false, true]", true);
+  this->AssertAnyIs(chunked_input0, true);
+  this->AssertAnyIs(chunked_input1, true);
+  this->AssertAnyIs(chunked_input2, false);
+  this->AssertAnyIs(chunked_input3, false);
+}
+
+struct AnyResult {
+  using c_type = typename BooleanType::c_type;
+
+  c_type any = false;
+};
+
+static AnyResult NaiveAny(const Array& array) {
+  using ArrayType = typename TypeTraits<BooleanType>::ArrayType;
+
+  AnyResult result;
+
+  const auto& array_numeric = reinterpret_cast<const ArrayType&>(array);
+  const auto true_count = array_numeric.true_count();
+
+  if (true_count > 0) {
+    result.any = true;
+  }
+  return result;
+}
+
+void ValidateAny(const Array& array) {
+  using Traits = TypeTraits<BooleanType>;
+  using ScalarType = typename Traits::ScalarType;
+
+  ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+  const ScalarType& out_any = out.scalar_as<ScalarType>();
+
+  auto expected = NaiveAny(array);
+  const auto& expected_any = static_cast<const ScalarType>(expected.any);
+
+  ASSERT_EQ(out_any, expected_any);
+}
+
+class TestRandomBooleanAnyKernel : public ::testing::Test {};
+
+TEST_F(TestRandomBooleanAnyKernel, RandomArrayAny) {
+  auto rand = random::RandomArrayGenerator(0x8afc055);
+  // Test size up to 1<<11 (2048).
+  for (size_t i = 3; i < 12; i += 2) {
+    for (auto null_probability : {0.0, 0.01, 0.1, 0.5, 0.99, 1.0}) {
+      int64_t base_length = (1UL << i) + 2;
+      auto array = rand.Boolean(base_length, null_probability, null_probability);
+      for (auto length_adjust : {-2, -1, 0, 1, 2}) {
+        int64_t length = (1UL << i) + length_adjust;
+        ValidateAny(*array->Slice(0, length));
+      }
+    }
+  }
+}

Review comment:
       Right. Should I drop this part of the test altogether?

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,109 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+  using Traits = TypeTraits<BooleanType>;
+  using ArrayType = typename Traits::ArrayType;
+  using c_type = typename BooleanType::c_type;
+  using ScalarType = typename Traits::ScalarType;
+
+ public:
+  void AssertAnyIs(const Datum& array, c_type expected) {
+    ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+    const ScalarType& out_any = out.scalar_as<ScalarType>();
+    const auto expected_any = static_cast<const ScalarType>(expected);
+    ASSERT_EQ(out_any, expected_any);
+  }
+
+  void AssertAnyIs(const std::string& json, c_type expected) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  void AssertAnyIs(const std::vector<std::string>& json, c_type expected) {
+    auto array = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+class TestAnyKernel : public TestPrimitiveAnyKernel {};
+
+TEST_F(TestAnyKernel, Basics) {
+  std::vector<std::string> chunked_input0 = {"[]", "[true]"};
+  std::vector<std::string> chunked_input1 = {"[true, true, null]", "[true, null]"};
+  std::vector<std::string> chunked_input2 = {"[false, false, false]", "[false]"};
+  std::vector<std::string> chunked_input3 = {"[false, null]", "[null, false]"};
+
+  this->AssertAnyIs("[false]", false);
+  this->AssertAnyIs("[true, false]", true);
+  this->AssertAnyIs("[null, null, null]", false);
+  this->AssertAnyIs("[false, false, false]", false);
+  this->AssertAnyIs("[false, false, false, null]", false);
+  this->AssertAnyIs("[true, null, true, true]", true);
+  this->AssertAnyIs("[false, null, false, true]", true);
+  this->AssertAnyIs("[true, null, false, true]", true);
+  this->AssertAnyIs(chunked_input0, true);
+  this->AssertAnyIs(chunked_input1, true);
+  this->AssertAnyIs(chunked_input2, false);
+  this->AssertAnyIs(chunked_input3, false);
+}
+
+struct AnyResult {
+  using c_type = typename BooleanType::c_type;
+
+  c_type any = false;
+};
+
+static AnyResult NaiveAny(const Array& array) {
+  using ArrayType = typename TypeTraits<BooleanType>::ArrayType;
+
+  AnyResult result;
+
+  const auto& array_numeric = reinterpret_cast<const ArrayType&>(array);

Review comment:
       Done!

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,109 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+  using Traits = TypeTraits<BooleanType>;
+  using ArrayType = typename Traits::ArrayType;
+  using c_type = typename BooleanType::c_type;
+  using ScalarType = typename Traits::ScalarType;
+
+ public:
+  void AssertAnyIs(const Datum& array, c_type expected) {
+    ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+    const ScalarType& out_any = out.scalar_as<ScalarType>();
+    const auto expected_any = static_cast<const ScalarType>(expected);
+    ASSERT_EQ(out_any, expected_any);
+  }
+
+  void AssertAnyIs(const std::string& json, c_type expected) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  void AssertAnyIs(const std::vector<std::string>& json, c_type expected) {
+    auto array = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+class TestAnyKernel : public TestPrimitiveAnyKernel {};
+
+TEST_F(TestAnyKernel, Basics) {
+  std::vector<std::string> chunked_input0 = {"[]", "[true]"};
+  std::vector<std::string> chunked_input1 = {"[true, true, null]", "[true, null]"};
+  std::vector<std::string> chunked_input2 = {"[false, false, false]", "[false]"};
+  std::vector<std::string> chunked_input3 = {"[false, null]", "[null, false]"};
+
+  this->AssertAnyIs("[false]", false);
+  this->AssertAnyIs("[true, false]", true);
+  this->AssertAnyIs("[null, null, null]", false);
+  this->AssertAnyIs("[false, false, false]", false);
+  this->AssertAnyIs("[false, false, false, null]", false);
+  this->AssertAnyIs("[true, null, true, true]", true);
+  this->AssertAnyIs("[false, null, false, true]", true);
+  this->AssertAnyIs("[true, null, false, true]", true);
+  this->AssertAnyIs(chunked_input0, true);
+  this->AssertAnyIs(chunked_input1, true);
+  this->AssertAnyIs(chunked_input2, false);
+  this->AssertAnyIs(chunked_input3, false);
+}
+
+struct AnyResult {
+  using c_type = typename BooleanType::c_type;
+
+  c_type any = false;
+};

Review comment:
       No - switched to plain bool variable in the new commit as you suggest below




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r496875601



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -547,6 +547,36 @@ struct BooleanMinMaxImpl : public MinMaxImpl<BooleanType, SimdLevel> {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct BooleanAnyImpl : public MinMaxImpl<BooleanType, SimdLevel> {
+  using StateType = MinMaxState<BooleanType, SimdLevel>;
+  using ArrayType = typename TypeTraits<BooleanType>::ArrayType;
+  using MinMaxImpl<BooleanType, SimdLevel>::MinMaxImpl;
+
+  void Consume(KernelContext*, const ExecBatch& batch) override {
+    // short-circuit if seen a True already
+    if (this->state.max == true) {
+      return;
+    }
+
+    ArrayType arr(batch[0].array());
+    const auto true_count = arr.true_count();
+    if (true_count > 0) {
+      this->state.max = true;
+    }
+  }
+
+  void Finalize(KernelContext*, Datum* out) override {
+    using ScalarType = typename TypeTraits<BooleanType>::ScalarType;

Review comment:
       Just use BooleanScalar here?

##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -547,6 +547,36 @@ struct BooleanMinMaxImpl : public MinMaxImpl<BooleanType, SimdLevel> {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct BooleanAnyImpl : public MinMaxImpl<BooleanType, SimdLevel> {
+  using StateType = MinMaxState<BooleanType, SimdLevel>;
+  using ArrayType = typename TypeTraits<BooleanType>::ArrayType;
+  using MinMaxImpl<BooleanType, SimdLevel>::MinMaxImpl;
+
+  void Consume(KernelContext*, const ExecBatch& batch) override {
+    // short-circuit if seen a True already
+    if (this->state.max == true) {
+      return;
+    }
+
+    ArrayType arr(batch[0].array());
+    const auto true_count = arr.true_count();
+    if (true_count > 0) {
+      this->state.max = true;
+    }
+  }
+
+  void Finalize(KernelContext*, Datum* out) override {
+    using ScalarType = typename TypeTraits<BooleanType>::ScalarType;
+
+    if (this->state.max == true) {
+      out->value = std::make_shared<ScalarType>(true);
+    } else {
+      out->value = std::make_shared<ScalarType>(false);
+    }

Review comment:
       Collapse all this to
   
   `out->value = std::make_shared<BooleanScalar>(this->state.max)`

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,109 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+  using Traits = TypeTraits<BooleanType>;
+  using ArrayType = typename Traits::ArrayType;
+  using c_type = typename BooleanType::c_type;
+  using ScalarType = typename Traits::ScalarType;
+
+ public:
+  void AssertAnyIs(const Datum& array, c_type expected) {
+    ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+    const ScalarType& out_any = out.scalar_as<ScalarType>();
+    const auto expected_any = static_cast<const ScalarType>(expected);
+    ASSERT_EQ(out_any, expected_any);
+  }
+
+  void AssertAnyIs(const std::string& json, c_type expected) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  void AssertAnyIs(const std::vector<std::string>& json, c_type expected) {
+    auto array = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+class TestAnyKernel : public TestPrimitiveAnyKernel {};
+
+TEST_F(TestAnyKernel, Basics) {
+  std::vector<std::string> chunked_input0 = {"[]", "[true]"};
+  std::vector<std::string> chunked_input1 = {"[true, true, null]", "[true, null]"};
+  std::vector<std::string> chunked_input2 = {"[false, false, false]", "[false]"};
+  std::vector<std::string> chunked_input3 = {"[false, null]", "[null, false]"};
+
+  this->AssertAnyIs("[false]", false);
+  this->AssertAnyIs("[true, false]", true);
+  this->AssertAnyIs("[null, null, null]", false);
+  this->AssertAnyIs("[false, false, false]", false);
+  this->AssertAnyIs("[false, false, false, null]", false);
+  this->AssertAnyIs("[true, null, true, true]", true);
+  this->AssertAnyIs("[false, null, false, true]", true);
+  this->AssertAnyIs("[true, null, false, true]", true);
+  this->AssertAnyIs(chunked_input0, true);
+  this->AssertAnyIs(chunked_input1, true);
+  this->AssertAnyIs(chunked_input2, false);
+  this->AssertAnyIs(chunked_input3, false);
+}
+
+struct AnyResult {
+  using c_type = typename BooleanType::c_type;
+
+  c_type any = false;
+};
+
+static AnyResult NaiveAny(const Array& array) {

Review comment:
       maybe just `bool` here?

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,109 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+  using Traits = TypeTraits<BooleanType>;
+  using ArrayType = typename Traits::ArrayType;
+  using c_type = typename BooleanType::c_type;
+  using ScalarType = typename Traits::ScalarType;
+
+ public:
+  void AssertAnyIs(const Datum& array, c_type expected) {
+    ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+    const ScalarType& out_any = out.scalar_as<ScalarType>();
+    const auto expected_any = static_cast<const ScalarType>(expected);
+    ASSERT_EQ(out_any, expected_any);
+  }
+
+  void AssertAnyIs(const std::string& json, c_type expected) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  void AssertAnyIs(const std::vector<std::string>& json, c_type expected) {
+    auto array = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+class TestAnyKernel : public TestPrimitiveAnyKernel {};
+
+TEST_F(TestAnyKernel, Basics) {
+  std::vector<std::string> chunked_input0 = {"[]", "[true]"};
+  std::vector<std::string> chunked_input1 = {"[true, true, null]", "[true, null]"};
+  std::vector<std::string> chunked_input2 = {"[false, false, false]", "[false]"};
+  std::vector<std::string> chunked_input3 = {"[false, null]", "[null, false]"};
+
+  this->AssertAnyIs("[false]", false);
+  this->AssertAnyIs("[true, false]", true);
+  this->AssertAnyIs("[null, null, null]", false);
+  this->AssertAnyIs("[false, false, false]", false);
+  this->AssertAnyIs("[false, false, false, null]", false);
+  this->AssertAnyIs("[true, null, true, true]", true);
+  this->AssertAnyIs("[false, null, false, true]", true);
+  this->AssertAnyIs("[true, null, false, true]", true);
+  this->AssertAnyIs(chunked_input0, true);
+  this->AssertAnyIs(chunked_input1, true);
+  this->AssertAnyIs(chunked_input2, false);
+  this->AssertAnyIs(chunked_input3, false);
+}
+
+struct AnyResult {
+  using c_type = typename BooleanType::c_type;
+
+  c_type any = false;
+};
+
+static AnyResult NaiveAny(const Array& array) {
+  using ArrayType = typename TypeTraits<BooleanType>::ArrayType;
+
+  AnyResult result;
+
+  const auto& array_numeric = reinterpret_cast<const ArrayType&>(array);
+  const auto true_count = array_numeric.true_count();
+
+  if (true_count > 0) {
+    result.any = true;
+  }
+  return result;
+}
+
+void ValidateAny(const Array& array) {
+  using Traits = TypeTraits<BooleanType>;
+  using ScalarType = typename Traits::ScalarType;
+
+  ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+  const ScalarType& out_any = out.scalar_as<ScalarType>();
+
+  auto expected = NaiveAny(array);
+  const auto& expected_any = static_cast<const ScalarType>(expected.any);
+
+  ASSERT_EQ(out_any, expected_any);
+}
+
+class TestRandomBooleanAnyKernel : public ::testing::Test {};
+
+TEST_F(TestRandomBooleanAnyKernel, RandomArrayAny) {
+  auto rand = random::RandomArrayGenerator(0x8afc055);
+  // Test size up to 1<<11 (2048).
+  for (size_t i = 3; i < 12; i += 2) {
+    for (auto null_probability : {0.0, 0.01, 0.1, 0.5, 0.99, 1.0}) {
+      int64_t base_length = (1UL << i) + 2;
+      auto array = rand.Boolean(base_length, null_probability, null_probability);
+      for (auto length_adjust : {-2, -1, 0, 1, 2}) {
+        int64_t length = (1UL << i) + length_adjust;
+        ValidateAny(*array->Slice(0, length));
+      }
+    }
+  }
+}

Review comment:
       It's unclear how useful randomly generated cases are for Any by its nature

##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -547,6 +547,36 @@ struct BooleanMinMaxImpl : public MinMaxImpl<BooleanType, SimdLevel> {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct BooleanAnyImpl : public MinMaxImpl<BooleanType, SimdLevel> {
+  using StateType = MinMaxState<BooleanType, SimdLevel>;

Review comment:
       Not used

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,109 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+  using Traits = TypeTraits<BooleanType>;
+  using ArrayType = typename Traits::ArrayType;
+  using c_type = typename BooleanType::c_type;
+  using ScalarType = typename Traits::ScalarType;

Review comment:
       These typedefs probably not needed, just use the Boolean* types within

##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -547,6 +547,36 @@ struct BooleanMinMaxImpl : public MinMaxImpl<BooleanType, SimdLevel> {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct BooleanAnyImpl : public MinMaxImpl<BooleanType, SimdLevel> {
+  using StateType = MinMaxState<BooleanType, SimdLevel>;
+  using ArrayType = typename TypeTraits<BooleanType>::ArrayType;
+  using MinMaxImpl<BooleanType, SimdLevel>::MinMaxImpl;
+
+  void Consume(KernelContext*, const ExecBatch& batch) override {
+    // short-circuit if seen a True already
+    if (this->state.max == true) {
+      return;
+    }
+
+    ArrayType arr(batch[0].array());

Review comment:
       Just use `BooleanArray` here

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,109 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+  using Traits = TypeTraits<BooleanType>;
+  using ArrayType = typename Traits::ArrayType;
+  using c_type = typename BooleanType::c_type;
+  using ScalarType = typename Traits::ScalarType;
+
+ public:
+  void AssertAnyIs(const Datum& array, c_type expected) {
+    ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+    const ScalarType& out_any = out.scalar_as<ScalarType>();
+    const auto expected_any = static_cast<const ScalarType>(expected);
+    ASSERT_EQ(out_any, expected_any);
+  }
+
+  void AssertAnyIs(const std::string& json, c_type expected) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  void AssertAnyIs(const std::vector<std::string>& json, c_type expected) {
+    auto array = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+class TestAnyKernel : public TestPrimitiveAnyKernel {};
+
+TEST_F(TestAnyKernel, Basics) {
+  std::vector<std::string> chunked_input0 = {"[]", "[true]"};
+  std::vector<std::string> chunked_input1 = {"[true, true, null]", "[true, null]"};
+  std::vector<std::string> chunked_input2 = {"[false, false, false]", "[false]"};
+  std::vector<std::string> chunked_input3 = {"[false, null]", "[null, false]"};
+
+  this->AssertAnyIs("[false]", false);
+  this->AssertAnyIs("[true, false]", true);
+  this->AssertAnyIs("[null, null, null]", false);
+  this->AssertAnyIs("[false, false, false]", false);
+  this->AssertAnyIs("[false, false, false, null]", false);
+  this->AssertAnyIs("[true, null, true, true]", true);
+  this->AssertAnyIs("[false, null, false, true]", true);
+  this->AssertAnyIs("[true, null, false, true]", true);
+  this->AssertAnyIs(chunked_input0, true);
+  this->AssertAnyIs(chunked_input1, true);
+  this->AssertAnyIs(chunked_input2, false);
+  this->AssertAnyIs(chunked_input3, false);
+}
+
+struct AnyResult {
+  using c_type = typename BooleanType::c_type;
+
+  c_type any = false;
+};

Review comment:
       Is this needed?

##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -584,6 +614,33 @@ struct MinMaxInitState {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct AnyInitState {
+  std::unique_ptr<KernelState> state;
+  KernelContext* ctx;
+  const DataType& in_type;
+  const std::shared_ptr<DataType>& out_type;
+  const MinMaxOptions& options;
+
+  AnyInitState(KernelContext* ctx, const DataType& in_type,
+               const std::shared_ptr<DataType>& out_type, const MinMaxOptions& options)
+      : ctx(ctx), in_type(in_type), out_type(out_type), options(options) {}
+
+  Status Visit(const DataType&) {
+    return Status::NotImplemented("No any kernel implemented");
+  }
+
+  Status Visit(const BooleanType&) {
+    state.reset(new BooleanAnyImpl<SimdLevel>(out_type, options));
+    return Status::OK();
+  }
+
+  std::unique_ptr<KernelState> Create() {
+    ctx->SetStatus(VisitTypeInline(in_type, this));
+    return std::move(state);
+  }
+};

Review comment:
       Is this really needed? Since there is only one type handled seems like you could omit all this and do something simpler in `AnyInit`

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,109 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+  using Traits = TypeTraits<BooleanType>;
+  using ArrayType = typename Traits::ArrayType;
+  using c_type = typename BooleanType::c_type;
+  using ScalarType = typename Traits::ScalarType;
+
+ public:
+  void AssertAnyIs(const Datum& array, c_type expected) {
+    ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+    const ScalarType& out_any = out.scalar_as<ScalarType>();
+    const auto expected_any = static_cast<const ScalarType>(expected);
+    ASSERT_EQ(out_any, expected_any);
+  }
+
+  void AssertAnyIs(const std::string& json, c_type expected) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  void AssertAnyIs(const std::vector<std::string>& json, c_type expected) {
+    auto array = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+class TestAnyKernel : public TestPrimitiveAnyKernel {};
+
+TEST_F(TestAnyKernel, Basics) {
+  std::vector<std::string> chunked_input0 = {"[]", "[true]"};
+  std::vector<std::string> chunked_input1 = {"[true, true, null]", "[true, null]"};
+  std::vector<std::string> chunked_input2 = {"[false, false, false]", "[false]"};
+  std::vector<std::string> chunked_input3 = {"[false, null]", "[null, false]"};
+
+  this->AssertAnyIs("[false]", false);
+  this->AssertAnyIs("[true, false]", true);
+  this->AssertAnyIs("[null, null, null]", false);
+  this->AssertAnyIs("[false, false, false]", false);
+  this->AssertAnyIs("[false, false, false, null]", false);
+  this->AssertAnyIs("[true, null, true, true]", true);
+  this->AssertAnyIs("[false, null, false, true]", true);
+  this->AssertAnyIs("[true, null, false, true]", true);
+  this->AssertAnyIs(chunked_input0, true);
+  this->AssertAnyIs(chunked_input1, true);
+  this->AssertAnyIs(chunked_input2, false);
+  this->AssertAnyIs(chunked_input3, false);
+}
+
+struct AnyResult {
+  using c_type = typename BooleanType::c_type;
+
+  c_type any = false;
+};
+
+static AnyResult NaiveAny(const Array& array) {
+  using ArrayType = typename TypeTraits<BooleanType>::ArrayType;
+
+  AnyResult result;
+
+  const auto& array_numeric = reinterpret_cast<const ArrayType&>(array);

Review comment:
       use `checked_cast<const BooleanArray&>` here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r504113404



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -142,6 +142,20 @@ Result<Datum> MinMax(const Datum& value,
                      const MinMaxOptions& options = MinMaxOptions::Defaults(),
                      ExecContext* ctx = NULLPTR);
 
+/// \brief Test whether any element in a boolean array evaluates to true.
+///
+/// This function returns true if any of the elements in the array evaluates
+/// to true and false otherwise. Null values are taken to evaluate to false.

Review comment:
       Ah, yes, you're correct that for this PR there is not yet a difference: you are only dealing with `any`, and it's only for `all` that there is a difference between "skipping nulls" and "treating nulls as False"
   
   So I would indeed update the docstring to simply state that nulls are skipped.
   
   For your example about `all` (I assume the code was meant to use `all` and not `any` ?), for the last line about Kleene logic I expect a return value of `null` (as the `ǹull` in the values can be both True or False, meaning that the result can be True of False, meaning it is unknown)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] arw2019 commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
arw2019 commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r523059162



##########
File path: cpp/src/arrow/compute/api_aggregate.cc
##########
@@ -41,8 +41,12 @@ Result<Datum> MinMax(const Datum& value, const MinMaxOptions& options, ExecConte
   return CallFunction("min_max", {value}, &options, ctx);
 }
 
-Result<Datum> Mode(const Datum& value, const ModeOptions& options, ExecContext* ctx) {
-  return CallFunction("mode", {value}, &options, ctx);
+Result<Datum> Any(const Datum& value, ExecContext* ctx) {
+  return CallFunction("any", {value}, ctx);
+}
+
+Result<Datum> Mode(const Datum& value, ExecContext* ctx) {

Review comment:
       Thanks - yes, that's the problem!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] arw2019 commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
arw2019 commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r504080604



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -584,6 +614,33 @@ struct MinMaxInitState {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct AnyInitState {
+  std::unique_ptr<KernelState> state;
+  KernelContext* ctx;
+  const DataType& in_type;
+  const std::shared_ptr<DataType>& out_type;
+  const MinMaxOptions& options;
+
+  AnyInitState(KernelContext* ctx, const DataType& in_type,
+               const std::shared_ptr<DataType>& out_type, const MinMaxOptions& options)
+      : ctx(ctx), in_type(in_type), out_type(out_type), options(options) {}
+
+  Status Visit(const DataType&) {
+    return Status::NotImplemented("No any kernel implemented");
+  }
+
+  Status Visit(const BooleanType&) {
+    state.reset(new BooleanAnyImpl<SimdLevel>(out_type, options));
+    return Status::OK();
+  }
+
+  std::unique_ptr<KernelState> Create() {
+    ctx->SetStatus(VisitTypeInline(in_type, this));
+    return std::move(state);
+  }
+};

Review comment:
       It's not. I've simplified it

##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -549,6 +549,28 @@ struct BooleanMinMaxImpl : public MinMaxImpl<BooleanType, SimdLevel> {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct BooleanAnyImpl : public MinMaxImpl<BooleanType, SimdLevel> {
+  using MinMaxImpl<BooleanType, SimdLevel>::MinMaxImpl;
+
+  void Consume(KernelContext*, const ExecBatch& batch) override {
+    // short-circuit if seen a True already
+    if (this->state.max == true) {
+      return;
+    }
+
+    BooleanArray arr(batch[0].array());
+    const auto true_count = arr.true_count();

Review comment:
       This worked pretty much straight away. Thanks @pitrou!!!

##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -142,6 +142,20 @@ Result<Datum> MinMax(const Datum& value,
                      const MinMaxOptions& options = MinMaxOptions::Defaults(),
                      ExecContext* ctx = NULLPTR);
 
+/// \brief Test whether any element in a boolean array evaluates to true.
+///
+/// This function returns true if any of the elements in the array evaluates
+/// to true and false otherwise. Null values are taken to evaluate to false.

Review comment:
       I'm not sure about the desired behavior either (although if it was up to me I would want to shoot for consistency with existing kernels).
   
   That said I may need to improve the phrasing in the docstring. I think he current kernel behavior is what you describe: we skip nulls and return whether we saw any `True` values so perhaps it's better to just say that. I *think* that as is treating `null` as `false` or skipping is the same in this case, since neither evaluate to true.
   
   A bit off-topic, but for an `all` kernel (which could be nice to have) I think we'd want to have null handling options, so that users could switch between
   ```
   any([true, null]) = true # skip nulls
   any([true, null]) = false # null evaluates as false
   any([true, null]) = false # kleene logic (I think?)
   ```
   (PS Apologies for  not replying to your comment directly - I have opened ARROW-10291 to track that discussion )

##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -549,6 +549,28 @@ struct BooleanMinMaxImpl : public MinMaxImpl<BooleanType, SimdLevel> {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct BooleanAnyImpl : public MinMaxImpl<BooleanType, SimdLevel> {

Review comment:
       It doesn't. I've rewritten it so it inherits from `ScalarAggregator` directly
   
   Also since it's no longer a template I moved it to `aggregate_basic.cc`

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,57 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+ public:
+  void AssertAnyIs(const Datum& array, bool expected) {
+    ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+    const BooleanScalar& out_any = out.scalar_as<BooleanScalar>();
+    const auto expected_any = static_cast<const BooleanScalar>(expected);
+    ASSERT_EQ(out_any, expected_any);
+  }
+
+  void AssertAnyIs(const std::string& json, bool expected) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  void AssertAnyIs(const std::vector<std::string>& json, bool expected) {
+    auto array = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  std::shared_ptr<DataType> type_singleton() {
+    return TypeTraits<BooleanType>::type_singleton();
+  }
+};
+
+class TestAnyKernel : public TestPrimitiveAnyKernel {};
+
+TEST_F(TestAnyKernel, Basics) {
+  std::vector<std::string> chunked_input0 = {"[]", "[true]"};
+  std::vector<std::string> chunked_input1 = {"[true, true, null]", "[true, null]"};
+  std::vector<std::string> chunked_input2 = {"[false, false, false]", "[false]"};
+  std::vector<std::string> chunked_input3 = {"[false, null]", "[null, false]"};
+
+  this->AssertAnyIs("[false]", false);

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] arw2019 commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
arw2019 commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r497678828



##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,109 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+  using Traits = TypeTraits<BooleanType>;
+  using ArrayType = typename Traits::ArrayType;
+  using c_type = typename BooleanType::c_type;
+  using ScalarType = typename Traits::ScalarType;
+
+ public:
+  void AssertAnyIs(const Datum& array, c_type expected) {
+    ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+    const ScalarType& out_any = out.scalar_as<ScalarType>();
+    const auto expected_any = static_cast<const ScalarType>(expected);
+    ASSERT_EQ(out_any, expected_any);
+  }
+
+  void AssertAnyIs(const std::string& json, c_type expected) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  void AssertAnyIs(const std::vector<std::string>& json, c_type expected) {
+    auto array = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+class TestAnyKernel : public TestPrimitiveAnyKernel {};
+
+TEST_F(TestAnyKernel, Basics) {
+  std::vector<std::string> chunked_input0 = {"[]", "[true]"};
+  std::vector<std::string> chunked_input1 = {"[true, true, null]", "[true, null]"};
+  std::vector<std::string> chunked_input2 = {"[false, false, false]", "[false]"};
+  std::vector<std::string> chunked_input3 = {"[false, null]", "[null, false]"};
+
+  this->AssertAnyIs("[false]", false);
+  this->AssertAnyIs("[true, false]", true);
+  this->AssertAnyIs("[null, null, null]", false);
+  this->AssertAnyIs("[false, false, false]", false);
+  this->AssertAnyIs("[false, false, false, null]", false);
+  this->AssertAnyIs("[true, null, true, true]", true);
+  this->AssertAnyIs("[false, null, false, true]", true);
+  this->AssertAnyIs("[true, null, false, true]", true);
+  this->AssertAnyIs(chunked_input0, true);
+  this->AssertAnyIs(chunked_input1, true);
+  this->AssertAnyIs(chunked_input2, false);
+  this->AssertAnyIs(chunked_input3, false);
+}
+
+struct AnyResult {
+  using c_type = typename BooleanType::c_type;
+
+  c_type any = false;
+};
+
+static AnyResult NaiveAny(const Array& array) {
+  using ArrayType = typename TypeTraits<BooleanType>::ArrayType;
+
+  AnyResult result;
+
+  const auto& array_numeric = reinterpret_cast<const ArrayType&>(array);
+  const auto true_count = array_numeric.true_count();
+
+  if (true_count > 0) {
+    result.any = true;
+  }
+  return result;
+}
+
+void ValidateAny(const Array& array) {
+  using Traits = TypeTraits<BooleanType>;
+  using ScalarType = typename Traits::ScalarType;
+
+  ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+  const ScalarType& out_any = out.scalar_as<ScalarType>();
+
+  auto expected = NaiveAny(array);
+  const auto& expected_any = static_cast<const ScalarType>(expected.any);
+
+  ASSERT_EQ(out_any, expected_any);
+}
+
+class TestRandomBooleanAnyKernel : public ::testing::Test {};
+
+TEST_F(TestRandomBooleanAnyKernel, RandomArrayAny) {
+  auto rand = random::RandomArrayGenerator(0x8afc055);
+  // Test size up to 1<<11 (2048).
+  for (size_t i = 3; i < 12; i += 2) {
+    for (auto null_probability : {0.0, 0.01, 0.1, 0.5, 0.99, 1.0}) {
+      int64_t base_length = (1UL << i) + 2;
+      auto array = rand.Boolean(base_length, null_probability, null_probability);
+      for (auto length_adjust : {-2, -1, 0, 1, 2}) {
+        int64_t length = (1UL << i) + length_adjust;
+        ValidateAny(*array->Slice(0, length));
+      }
+    }
+  }
+}

Review comment:
       Done!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#issuecomment-701357437


   Just a high level remark (didn't yet look at the code), but I think the example you gave:
   
   ```
   In []: a = pa.array([True, None], type='bool') 
       ...:  
       ...: # option 1 
       ...: pc.any(a).as_py() is True 
       ...: pc.any_kleene(a).as_py() is None 
       ...:  
       ...: # option 2 
       ...: pc.any(null_handling='skip') is True 
       ...: pc.any(null_handling='emit_null') is None                                                     
   ```
   
   has a wrong output for the kleene version. With Kleene logic, also the second output would be True, as the array already contains a True, the missing value doesn't matter anymore. 
   
   Using Kleene logic or not is not the same as the skip/emit null handling. By default, if nulls are skipped, then it doesn't matter if you use Kleene logic or not, since there are no nulls to behave in certain ways. So only when not skipping nulls, you get a different behaviour: ``any([True, None], skipna=False)`` or `any_kleene([True, None], skipna=False)` would still both give True as result, since there is any True. But eg  ``any([False, None], skipna=False)`` woud give False (the missing being False) vs `any_kleene([False, None], skipna=False)` giving null as result.
   
   See also our discussions in pandas about this (https://github.com/pandas-dev/pandas/issues/29686; https://pandas.pydata.org/pandas-docs/stable/user_guide/boolean.html#kleene-logical-operations)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r497568545



##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,109 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+  using Traits = TypeTraits<BooleanType>;
+  using ArrayType = typename Traits::ArrayType;
+  using c_type = typename BooleanType::c_type;
+  using ScalarType = typename Traits::ScalarType;
+
+ public:
+  void AssertAnyIs(const Datum& array, c_type expected) {
+    ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+    const ScalarType& out_any = out.scalar_as<ScalarType>();
+    const auto expected_any = static_cast<const ScalarType>(expected);
+    ASSERT_EQ(out_any, expected_any);
+  }
+
+  void AssertAnyIs(const std::string& json, c_type expected) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  void AssertAnyIs(const std::vector<std::string>& json, c_type expected) {
+    auto array = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+class TestAnyKernel : public TestPrimitiveAnyKernel {};
+
+TEST_F(TestAnyKernel, Basics) {
+  std::vector<std::string> chunked_input0 = {"[]", "[true]"};
+  std::vector<std::string> chunked_input1 = {"[true, true, null]", "[true, null]"};
+  std::vector<std::string> chunked_input2 = {"[false, false, false]", "[false]"};
+  std::vector<std::string> chunked_input3 = {"[false, null]", "[null, false]"};
+
+  this->AssertAnyIs("[false]", false);
+  this->AssertAnyIs("[true, false]", true);
+  this->AssertAnyIs("[null, null, null]", false);
+  this->AssertAnyIs("[false, false, false]", false);
+  this->AssertAnyIs("[false, false, false, null]", false);
+  this->AssertAnyIs("[true, null, true, true]", true);
+  this->AssertAnyIs("[false, null, false, true]", true);
+  this->AssertAnyIs("[true, null, false, true]", true);
+  this->AssertAnyIs(chunked_input0, true);
+  this->AssertAnyIs(chunked_input1, true);
+  this->AssertAnyIs(chunked_input2, false);
+  this->AssertAnyIs(chunked_input3, false);
+}
+
+struct AnyResult {
+  using c_type = typename BooleanType::c_type;
+
+  c_type any = false;
+};
+
+static AnyResult NaiveAny(const Array& array) {
+  using ArrayType = typename TypeTraits<BooleanType>::ArrayType;
+
+  AnyResult result;
+
+  const auto& array_numeric = reinterpret_cast<const ArrayType&>(array);
+  const auto true_count = array_numeric.true_count();
+
+  if (true_count > 0) {
+    result.any = true;
+  }
+  return result;
+}
+
+void ValidateAny(const Array& array) {
+  using Traits = TypeTraits<BooleanType>;
+  using ScalarType = typename Traits::ScalarType;
+
+  ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+  const ScalarType& out_any = out.scalar_as<ScalarType>();
+
+  auto expected = NaiveAny(array);
+  const auto& expected_any = static_cast<const ScalarType>(expected.any);
+
+  ASSERT_EQ(out_any, expected_any);
+}
+
+class TestRandomBooleanAnyKernel : public ::testing::Test {};
+
+TEST_F(TestRandomBooleanAnyKernel, RandomArrayAny) {
+  auto rand = random::RandomArrayGenerator(0x8afc055);
+  // Test size up to 1<<11 (2048).
+  for (size_t i = 3; i < 12; i += 2) {
+    for (auto null_probability : {0.0, 0.01, 0.1, 0.5, 0.99, 1.0}) {
+      int64_t base_length = (1UL << i) + 2;
+      auto array = rand.Boolean(base_length, null_probability, null_probability);
+      for (auto length_adjust : {-2, -1, 0, 1, 2}) {
+        int64_t length = (1UL << i) + length_adjust;
+        ValidateAny(*array->Slice(0, length));
+      }
+    }
+  }
+}

Review comment:
       I think you can




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r503927482



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -142,6 +142,20 @@ Result<Datum> MinMax(const Datum& value,
                      const MinMaxOptions& options = MinMaxOptions::Defaults(),
                      ExecContext* ctx = NULLPTR);
 
+/// \brief Test whether any element in a boolean array evaluates to true.
+///
+/// This function returns true if any of the elements in the array evaluates
+/// to true and false otherwise. Null values are taken to evaluate to false.

Review comment:
       Is this the behaviour we want? (the *"null values are taken to evaluate to false."*
   
   Any/all are of course not the most typical reductions (so I am also not fully sure about the desired behaviour), but, for other reductions we actually skip nulls. And skipping nulls is not the same as evaluating them to False
   
   (somewhat related to my comment at https://github.com/apache/arrow/pull/8294#issuecomment-701357437)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r522942123



##########
File path: cpp/src/arrow/compute/api_aggregate.cc
##########
@@ -41,8 +41,12 @@ Result<Datum> MinMax(const Datum& value, const MinMaxOptions& options, ExecConte
   return CallFunction("min_max", {value}, &options, ctx);
 }
 
-Result<Datum> Mode(const Datum& value, const ModeOptions& options, ExecContext* ctx) {
-  return CallFunction("mode", {value}, &options, ctx);
+Result<Datum> Any(const Datum& value, ExecContext* ctx) {
+  return CallFunction("any", {value}, ctx);
+}
+
+Result<Datum> Mode(const Datum& value, ExecContext* ctx) {

Review comment:
       The ModeOptions are dropped here, might be a rebase error?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#issuecomment-733069559


   I've rebased now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r503927482



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -142,6 +142,20 @@ Result<Datum> MinMax(const Datum& value,
                      const MinMaxOptions& options = MinMaxOptions::Defaults(),
                      ExecContext* ctx = NULLPTR);
 
+/// \brief Test whether any element in a boolean array evaluates to true.
+///
+/// This function returns true if any of the elements in the array evaluates
+/// to true and false otherwise. Null values are taken to evaluate to false.

Review comment:
       Is this the behaviour we want? (the *"null values are taken to evaluate to false"*)
   
   Any/all are of course not the most typical reductions (so I am also not fully sure about the desired behaviour), but, for other reductions we actually skip nulls. And skipping nulls is not the same as evaluating them to False
   
   (somewhat related to my comment at https://github.com/apache/arrow/pull/8294#issuecomment-701357437)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r522941736



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -154,7 +154,21 @@ Result<Datum> MinMax(const Datum& value,
                      const MinMaxOptions& options = MinMaxOptions::Defaults(),
                      ExecContext* ctx = NULLPTR);
 
-/// \brief Calculate the modal (most common) values of a numeric array
+/// \brief Test whether any element in a boolean array evaluates to true.
+///
+/// This function returns true if any of the elements in the array evaluates
+/// to true and false otherwise. Null values are skipped.
+///
+/// \param[in] value input datum, expecting a boolean array
+/// \param[in] ctx the function execution context, optional
+/// \return resulting datum as a BooleanScalar
+
+/// \since 2.0.0

Review comment:
       ```suggestion
   /// \since 3.0.0
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r504419732



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -142,6 +142,20 @@ Result<Datum> MinMax(const Datum& value,
                      const MinMaxOptions& options = MinMaxOptions::Defaults(),
                      ExecContext* ctx = NULLPTR);
 
+/// \brief Test whether any element in a boolean array evaluates to true.
+///
+/// This function returns true if any of the elements in the array evaluates
+/// to true and false otherwise. Null values are taken to evaluate to false.

Review comment:
       > Sorry yes. The examples were for all. Just to make sure I understand, with kleene logic, we emit null if there's a null anywhere in the input
   
   No, we only emit null if the presence of the null (the fact that it could be either True or False) would influence the result:
   
   ```
   any_kleene([true, null]) = true
   any_kleene([false, null]) = null
   all_kleene([true, null]) = null
   all_kleene([false, null]) = false
   ```
   
   But, the above is only when *not* skipping nulls (because with the default of skipping, there is no difference with non-kleene logic)
   
   See the links I mentioned in https://github.com/apache/arrow/pull/8294#issuecomment-701357437, and also the Julia docs have a good explanation of three-valued (Kleene) logic: https://docs.julialang.org/en/v1/manual/missing/index.html#Logical-operators-1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r504462669



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -142,6 +142,20 @@ Result<Datum> MinMax(const Datum& value,
                      const MinMaxOptions& options = MinMaxOptions::Defaults(),
                      ExecContext* ctx = NULLPTR);
 
+/// \brief Test whether any element in a boolean array evaluates to true.
+///
+/// This function returns true if any of the elements in the array evaluates
+/// to true and false otherwise. Null values are taken to evaluate to false.

Review comment:
       BTW, the result you show (a null as result in all cases) is what I expect for the *non*-kleene version when not skipping nulls (I would expect that nulls propagate in that case, and not necessarily be interpreted as false)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#issuecomment-700440254


   https://issues.apache.org/jira/browse/ARROW-1846


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r498344896



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -549,6 +549,28 @@ struct BooleanMinMaxImpl : public MinMaxImpl<BooleanType, SimdLevel> {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct BooleanAnyImpl : public MinMaxImpl<BooleanType, SimdLevel> {

Review comment:
       I don't really understand why this is inheriting from MinMax. Does it help reduce the code size in any way?

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -726,6 +726,57 @@ TYPED_TEST(TestRandomNumericMinMaxKernel, RandomArrayMinMax) {
   }
 }
 
+
+//
+// Any
+//
+
+class TestPrimitiveAnyKernel : public ::testing::Test {
+ public:
+  void AssertAnyIs(const Datum& array, bool expected) {
+    ASSERT_OK_AND_ASSIGN(Datum out, Any(array));
+    const BooleanScalar& out_any = out.scalar_as<BooleanScalar>();
+    const auto expected_any = static_cast<const BooleanScalar>(expected);
+    ASSERT_EQ(out_any, expected_any);
+  }
+
+  void AssertAnyIs(const std::string& json, bool expected) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  void AssertAnyIs(const std::vector<std::string>& json, bool expected) {
+    auto array = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertAnyIs(array, expected);
+  }
+
+  std::shared_ptr<DataType> type_singleton() {
+    return TypeTraits<BooleanType>::type_singleton();
+  }
+};
+
+class TestAnyKernel : public TestPrimitiveAnyKernel {};
+
+TEST_F(TestAnyKernel, Basics) {
+  std::vector<std::string> chunked_input0 = {"[]", "[true]"};
+  std::vector<std::string> chunked_input1 = {"[true, true, null]", "[true, null]"};
+  std::vector<std::string> chunked_input2 = {"[false, false, false]", "[false]"};
+  std::vector<std::string> chunked_input3 = {"[false, null]", "[null, false]"};
+
+  this->AssertAnyIs("[false]", false);

Review comment:
       Please also test with an empty array.

##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -549,6 +549,28 @@ struct BooleanMinMaxImpl : public MinMaxImpl<BooleanType, SimdLevel> {
   }
 };
 
+template <SimdLevel::type SimdLevel>
+struct BooleanAnyImpl : public MinMaxImpl<BooleanType, SimdLevel> {
+  using MinMaxImpl<BooleanType, SimdLevel>::MinMaxImpl;
+
+  void Consume(KernelContext*, const ExecBatch& batch) override {
+    // short-circuit if seen a True already
+    if (this->state.max == true) {
+      return;
+    }
+
+    BooleanArray arr(batch[0].array());
+    const auto true_count = arr.true_count();

Review comment:
       This forces counting the set bits in the whole array, but we're only interested in the presence of a single set bit, so we should be able to shortcut much more aggressively.
   
   You may try to use `OptionalBinaryBitBlockCounter` for that. Untested:
   ```c++
   const auto& data = *batch[0].array();
   OptionalBinaryBitBlockCounter counter(data.buffers[0], data.offset, data.buffers[1], data.offset, data.length);
   int64_t position = 0;
   while (position < data.length) {
     const auto block = counter.NextAndBlock();
     if (block.popcount > 0) {
       this->state.max = true;
       break;
     }
     position += block.length;
   }
   ```
   }
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] arw2019 commented on a change in pull request #8294: ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data

Posted by GitBox <gi...@apache.org>.
arw2019 commented on a change in pull request #8294:
URL: https://github.com/apache/arrow/pull/8294#discussion_r504128151



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -142,6 +142,20 @@ Result<Datum> MinMax(const Datum& value,
                      const MinMaxOptions& options = MinMaxOptions::Defaults(),
                      ExecContext* ctx = NULLPTR);
 
+/// \brief Test whether any element in a boolean array evaluates to true.
+///
+/// This function returns true if any of the elements in the array evaluates
+/// to true and false otherwise. Null values are taken to evaluate to false.

Review comment:
       Docstring updated.
   
   Sorry yes. The examples were for `all`. Just to make sure I understand, with kleene logic, we emit null if there's a null anywhere in the input
   ```
   any_kleene([true, null]) = null
   any_kleene([false, null]) = null
   all_kleene([true, null]) = null
   all_kleene([false, null]) = null
   ```
   I opened ARROW-10301 re: `all` kernel




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org