You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/08 01:24:44 UTC

[GitHub] [arrow] rok opened a new pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

rok opened a new pull request #10476:
URL: https://github.com/apache/arrow/pull/10476


   This is to resolve [ARROW-12499](https://issues.apache.org/jira/browse/ARROW-12499).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r647625730



##########
File path: r/tests/testthat/test-compute-aggregate.R
##########
@@ -383,33 +383,30 @@ test_that("value_counts", {
 })
 
 test_that("any.Array and any.ChunkedArray", {
-  
   data <- c(1:10, NA, NA)
 
-  expect_vector_equal(any(input > 5), data)
+  expect_vector_equal(any(input > 5, na.rm = TRUE), data)

Review comment:
       It does indeed:
   ```R
   Failure (test-compute-aggregate.R:388:3): any.Array and any.ChunkedArray
   as.vector(x) not equal to `y`.
   'is.NA' value mismatch: 0 in current 1 in target
   Backtrace:
    1. expect_vector_equal(any(input > 5), data) test-compute-aggregate.R:388:2
    2. expect_as_vector(via_array, expected, ignore_attr, ...) helper-expectation.R:174:4
    3. testthat:::expect_fun(as.vector(x), y, ...) helper-expectation.R:24:2
   
   Failure (test-compute-aggregate.R:388:3): any.Array and any.ChunkedArray
   as.vector(x) not equal to `y`.
   'is.NA' value mismatch: 0 in current 1 in target
   Backtrace:
    1. expect_vector_equal(any(input > 5), data) test-compute-aggregate.R:388:2
    2. expect_as_vector(via_chunked, expected, ignore_attr, ...) helper-expectation.R:187:4
    3. testthat:::expect_fun(as.vector(x), y, ...) helper-expectation.R:24:2
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r651772086



##########
File path: docs/source/cpp/compute.rst
##########
@@ -213,18 +213,23 @@ Aggregations
 
 Notes:
 
-* \(1) Output is a ``{"min": input type, "max": input type}`` Struct.
+* \(1) If null values are taken into account by setting ScalarAggregateOptions
+  parameter skip_nulls = false then `Kleene logic`_ logic is applied.
 
-* \(2) Output is an array of ``{"mode": input type, "count": Int64}`` Struct.
+* \(2) Output is a ``{"min": input type, "max": input type}`` Struct.
+
+* \(3) Output is an array of ``{"mode": input type, "count": Int64}`` Struct.
   It contains the *N* most common elements in the input, in descending
   order, where *N* is given in :member:`ModeOptions::n`.
   If two values have the same count, the smallest one comes first.
   Note that the output can have less than *N* elements if the input has
   less than *N* distinct values.
 
-* \(3) Output is Float64 or input type, depending on QuantileOptions.
+* \(4) Output is Float64 or input type, depending on QuantileOptions.
+
+* \(5) Output is Int64, UInt64 or Float64, depending on the input type.
 
-* \(4) Output is Int64, UInt64 or Float64, depending on the input type.
+.. _Kleene logic: https://en.wikipedia.org/wiki/Three-valued_logic#Kleene_and_Priest_logics

Review comment:
       Removed. I was not sure how scopes in rst work :).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#issuecomment-867405359


   Can you take a look at my comment from above, which I think is not yet addressed:
   
   > Can you also update the docstrings which are defined in aggregate_basic.cc ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r648385623



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##########
@@ -166,32 +168,48 @@ struct BooleanAnyImpl : public ScalarAggregator {
   Status MergeFrom(KernelContext*, KernelState&& src) override {
     const auto& other = checked_cast<const BooleanAnyImpl&>(src);
     this->any |= other.any;
+    this->has_nulls |= other.has_nulls;
     return Status::OK();
   }
 
-  Status Finalize(KernelContext*, Datum* out) override {
-    out->value = std::make_shared<BooleanScalar>(this->any);
+  Status Finalize(KernelContext* ctx, Datum* out) override {
+    if (!options.skip_nulls && !this->any && this->has_nulls) {

Review comment:
       > Meanwhile Pandas' any is non-kleen.
   
   The pandas `any`/`all` methods were broken for object dtype (and since numpy doesn't support nulls in its boolean dtype, whenever you have missing values, you have object dtype), so best not to use that as a reference (see eg https://github.com/pandas-dev/pandas/issues/27709)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r650962333



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##########
@@ -166,32 +168,48 @@ struct BooleanAnyImpl : public ScalarAggregator {
   Status MergeFrom(KernelContext*, KernelState&& src) override {
     const auto& other = checked_cast<const BooleanAnyImpl&>(src);
     this->any |= other.any;
+    this->has_nulls |= other.has_nulls;
     return Status::OK();
   }
 
-  Status Finalize(KernelContext*, Datum* out) override {
-    out->value = std::make_shared<BooleanScalar>(this->any);
+  Status Finalize(KernelContext* ctx, Datum* out) override {
+    if (!options.skip_nulls && !this->any && this->has_nulls) {

Review comment:
       I am not sure there is a good use case for a non-kleene version, so I am fine with just documenting for now that the behaviour follows Kleene logic (so is the reducing version of and/or_kleene)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#issuecomment-856807294


   (Oh sorry, I missed that this is a draft.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#issuecomment-856823090


   > C++ parts look good to me.
   
   I'm thinking to if boolean checks would be better instead of counts?
   
   > Now that the kernel has options, should we add a Python convenience function in `pyarrow/compute.py`?
   
   Yeah definitely, next commit :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r647881835



##########
File path: r/tests/testthat/test-compute-aggregate.R
##########
@@ -383,33 +383,30 @@ test_that("value_counts", {
 })
 
 test_that("any.Array and any.ChunkedArray", {
-  
   data <- c(1:10, NA, NA)
 
-  expect_vector_equal(any(input > 5), data)
+  expect_vector_equal(any(input > 5, na.rm = TRUE), data)

Review comment:
       Indeed @jonkeane it was a behavior issue. Should be fixed now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r651771596



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -201,30 +201,42 @@ Result<Datum> MinMax(
 /// \brief Test whether any element in a boolean array evaluates to true.
 ///
 /// This function returns true if any of the elements in the array evaluates
-/// to true and false otherwise. Null values are skipped.
+/// to true and false otherwise. Null values are ignored by default.
+/// If null values are taken into account by setting ScalarAggregateOptions
+/// parameter skip_nulls = false then kleen logic is applied.

Review comment:
       I think it KleeneOr. Added.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r647629868



##########
File path: r/tests/testthat/test-compute-aggregate.R
##########
@@ -383,33 +383,30 @@ test_that("value_counts", {
 })
 
 test_that("any.Array and any.ChunkedArray", {
-  
   data <- c(1:10, NA, NA)
 
-  expect_vector_equal(any(input > 5), data)
+  expect_vector_equal(any(input > 5, na.rm = TRUE), data)

Review comment:
       I'm not sure how to read what happens in `expect_vector_equal`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane closed pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
jonkeane closed pull request #10476:
URL: https://github.com/apache/arrow/pull/10476


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r647632845



##########
File path: r/tests/testthat/test-compute-aggregate.R
##########
@@ -383,33 +383,30 @@ test_that("value_counts", {
 })
 
 test_that("any.Array and any.ChunkedArray", {
-  
   data <- c(1:10, NA, NA)
 
-  expect_vector_equal(any(input > 5), data)
+  expect_vector_equal(any(input > 5, na.rm = TRUE), data)
   expect_vector_equal(any(input < 1), data)
   expect_vector_equal(any(input < 1, na.rm = TRUE), data)
   
   data_logical <- c(TRUE, FALSE, TRUE, NA, FALSE)

Review comment:
       To properly test that `any()` is handling `na.rm` consistent with the way base R does, we should test with a vector that does not have any `TRUE` in it. Maybe let's add another few lines of test code here to verify that, and perserve the existing tests too:
   ```suggestion
     data_logical <- c(NA, FALSE, FALSE, NA, FALSE)
     expect_vector_equal(any(input), data_logical)
     expect_vector_equal(any(input, na.rm = FALSE), data_logical)
     expect_vector_equal(any(input, na.rm = TRUE), data_logical)
     
     data_logical <- c(TRUE, FALSE, TRUE, NA, FALSE)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#issuecomment-880017358


   Thanks all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#issuecomment-856825753


   > > C++ parts look good to me.
   > 
   > I'm thinking to if boolean checks would be better instead of counts?
   
   I debated whether to comment on this. I would slightly prefer the boolean check over the count but the intent is clear either way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r648045031



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##########
@@ -166,32 +168,48 @@ struct BooleanAnyImpl : public ScalarAggregator {
   Status MergeFrom(KernelContext*, KernelState&& src) override {
     const auto& other = checked_cast<const BooleanAnyImpl&>(src);
     this->any |= other.any;
+    this->has_nulls |= other.has_nulls;
     return Status::OK();
   }
 
-  Status Finalize(KernelContext*, Datum* out) override {
-    out->value = std::make_shared<BooleanScalar>(this->any);
+  Status Finalize(KernelContext* ctx, Datum* out) override {
+    if (!options.skip_nulls && !this->any && this->has_nulls) {

Review comment:
       What's the reason for the `&& !this->any` part? That seems to suggest "kleene logic"? But for the non-kleene version, I expect any null in the input to always give null for the output.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r650794819



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##########
@@ -166,32 +168,48 @@ struct BooleanAnyImpl : public ScalarAggregator {
   Status MergeFrom(KernelContext*, KernelState&& src) override {
     const auto& other = checked_cast<const BooleanAnyImpl&>(src);
     this->any |= other.any;
+    this->has_nulls |= other.has_nulls;
     return Status::OK();
   }
 
-  Status Finalize(KernelContext*, Datum* out) override {
-    out->value = std::make_shared<BooleanScalar>(this->any);
+  Status Finalize(KernelContext* ctx, Datum* out) override {
+    if (!options.skip_nulls && !this->any && this->has_nulls) {

Review comment:
       Shall I:
   1. Document the behavior as kleen?
   2. Rather revert the change and first implement `al_kleen/any_kleen` ([ARROW-10291](https://issues.apache.org/jira/browse/ARROW-10291)) and then map R to those for `skip_nulls==False`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#issuecomment-867874847


   > Can you take a look at my comment from above, which I think is not yet addressed:
   > 
   > > Can you also update the docstrings which are defined in aggregate_basic.cc ?
   
   Oh sorry, I missed that one. Added in [last commit](https://github.com/apache/arrow/pull/10476/commits/c5976a94287cd5a0fd122fc4cbe56be93719b293).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r648212025



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##########
@@ -166,32 +168,48 @@ struct BooleanAnyImpl : public ScalarAggregator {
   Status MergeFrom(KernelContext*, KernelState&& src) override {
     const auto& other = checked_cast<const BooleanAnyImpl&>(src);
     this->any |= other.any;
+    this->has_nulls |= other.has_nulls;
     return Status::OK();
   }
 
-  Status Finalize(KernelContext*, Datum* out) override {
-    out->value = std::make_shared<BooleanScalar>(this->any);
+  Status Finalize(KernelContext* ctx, Datum* out) override {
+    if (!options.skip_nulls && !this->any && this->has_nulls) {

Review comment:
       Indeed this is to to make the PR "kleen" to match [R behavior](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/any). Meanwhile [Pandas' any is non-kleen](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.any.html).
   
   ```
   >>> import pandas as pd
   >>> pd.Series([None, None]).any(skipna=True)
   False
   >>> pd.Series([None, None]).any(skipna=False)
   >>> 
   ```
   
   We have three options IMO:
   - Revert to non-kleene c++ behaviour and add a small fix to R
   - Add kleen any/all kernels and route in R depending on flags
   - Keep c++ as is and add a fix to Python (that could introduce a lot of new unvanted logic)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#issuecomment-856368854


   https://issues.apache.org/jira/browse/ARROW-12499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r651318115



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##########
@@ -166,32 +168,48 @@ struct BooleanAnyImpl : public ScalarAggregator {
   Status MergeFrom(KernelContext*, KernelState&& src) override {
     const auto& other = checked_cast<const BooleanAnyImpl&>(src);
     this->any |= other.any;
+    this->has_nulls |= other.has_nulls;
     return Status::OK();
   }
 
-  Status Finalize(KernelContext*, Datum* out) override {
-    out->value = std::make_shared<BooleanScalar>(this->any);
+  Status Finalize(KernelContext* ctx, Datum* out) override {
+    if (!options.skip_nulls && !this->any && this->has_nulls) {

Review comment:
       I documented the new behavior in [compute.rst](https://github.com/apache/arrow/pull/10476/files#diff-ce5b94577014735990903d3d03bd4ea4b8c8e6d32f5227592e60b7dd6a912d59) and [api_aggregate.h](https://github.com/apache/arrow/pull/10476/files#diff-694a635f185cf0d22dd6eefd91acedb48ed0bf71b15fd591aa6e7c508ba4cb09).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r648385623



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##########
@@ -166,32 +168,48 @@ struct BooleanAnyImpl : public ScalarAggregator {
   Status MergeFrom(KernelContext*, KernelState&& src) override {
     const auto& other = checked_cast<const BooleanAnyImpl&>(src);
     this->any |= other.any;
+    this->has_nulls |= other.has_nulls;
     return Status::OK();
   }
 
-  Status Finalize(KernelContext*, Datum* out) override {
-    out->value = std::make_shared<BooleanScalar>(this->any);
+  Status Finalize(KernelContext* ctx, Datum* out) override {
+    if (!options.skip_nulls && !this->any && this->has_nulls) {

Review comment:
       > Meanwhile Pandas' any is non-kleen.
   
   The pandas `any`/`all` methods were broken for object dtype (and since numpy doesn't support nulls in its boolean dtype, whenever you have missing values, you have object dtype), so best not to use that as a reference (see eg https://github.com/pandas-dev/pandas/issues/27709)
   
   For the new nullable boolean dtype in pandas, the any/all methods also use kleene logic like in R.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r648471718



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##########
@@ -166,32 +168,48 @@ struct BooleanAnyImpl : public ScalarAggregator {
   Status MergeFrom(KernelContext*, KernelState&& src) override {
     const auto& other = checked_cast<const BooleanAnyImpl&>(src);
     this->any |= other.any;
+    this->has_nulls |= other.has_nulls;
     return Status::OK();
   }
 
-  Status Finalize(KernelContext*, Datum* out) override {
-    out->value = std::make_shared<BooleanScalar>(this->any);
+  Status Finalize(KernelContext* ctx, Datum* out) override {
+    if (!options.skip_nulls && !this->any && this->has_nulls) {

Review comment:
       So how about then just making this kernel "kleene" and just document that fact?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r647622562



##########
File path: r/tests/testthat/test-compute-aggregate.R
##########
@@ -383,33 +383,30 @@ test_that("value_counts", {
 })
 
 test_that("any.Array and any.ChunkedArray", {
-  
   data <- c(1:10, NA, NA)
 
-  expect_vector_equal(any(input > 5), data)
+  expect_vector_equal(any(input > 5, na.rm = TRUE), data)
   expect_vector_equal(any(input < 1), data)
   expect_vector_equal(any(input < 1, na.rm = TRUE), data)
   
   data_logical <- c(TRUE, FALSE, TRUE, NA, FALSE)
   
-  expect_vector_equal(any(input), data_logical)
+  expect_vector_equal(any(input, na.rm = TRUE), data_logical)

Review comment:
       Same question as above, this line and the line just below look like they are not testing the exact same thing, yeah? Am I missing something?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#issuecomment-856922943


   @lidavidm added the python test and switched from `null` count to `has_nulls` boolean. Also added a Python test so this should be good to review now.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r647637941



##########
File path: r/tests/testthat/test-compute-aggregate.R
##########
@@ -383,33 +383,30 @@ test_that("value_counts", {
 })
 
 test_that("any.Array and any.ChunkedArray", {
-  
   data <- c(1:10, NA, NA)
 
-  expect_vector_equal(any(input > 5), data)
+  expect_vector_equal(any(input > 5, na.rm = TRUE), data)

Review comment:
       So the errors you're seeing indicate that we've got a mismatch in behavior. I see that Ian has commented below about more robust tests + how to match the behavior (but I'm happy to add to it if that's unclear!)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#issuecomment-857293373


   @lidavidm - I've introduced some small changes to c++ to enable desired behaviour so perhaps re-review is in order. Changes are all in [this commit.](https://github.com/apache/arrow/pull/10476/commits/69763766a7b44c66c8f9cce09845b3e9d1b6e218)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r647616702



##########
File path: python/pyarrow/tests/test_compute.py
##########
@@ -494,30 +494,40 @@ def test_min_max():
 
 def test_any():
     # ARROW-1846
+
+    options = pc.ScalarAggregateOptions(skip_nulls=False)

Review comment:
       Hmm, I had thought that generally, for kernels with options, we add a wrapper in pyarrow/compute.py. But now that I look, this isn't true for many of the kernels. So having Python tests is enough.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#issuecomment-856925764


   Hey @thisisnic, @jonkeane, could you take a look at the [R part](https://github.com/apache/arrow/pull/10476/files#diff-9d33f036186210a1076f3adc590e0cc60a0b65556cefdbd15e6d26aa105f7d76) of this PR? It would be great to have an extra check :).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#issuecomment-878128612


   @jorisvandenbossche @jonkeane @ianmcook Does this need another round of reviews?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r647637242



##########
File path: r/tests/testthat/test-compute-aggregate.R
##########
@@ -383,33 +383,30 @@ test_that("value_counts", {
 })
 
 test_that("any.Array and any.ChunkedArray", {
-  
   data <- c(1:10, NA, NA)
 
-  expect_vector_equal(any(input > 5), data)
+  expect_vector_equal(any(input > 5, na.rm = TRUE), data)

Review comment:
       [`expect_vector_equal` takes an expression and a vector](https://github.com/apache/arrow/blob/ed4f79c77e5d34d5996fa287df0934dd2f93a12a/r/tests/testthat/helper-expectation.R#L159-L195) and evaluates the expression in base R, then converts the vector converts it into an array and evaluates the expression again (which typically will call arrow compute functions), and finally converts the vector to a chunked array and does the same.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#issuecomment-871350397


   Ping :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r647630733



##########
File path: r/tests/testthat/test-compute-aggregate.R
##########
@@ -383,33 +383,30 @@ test_that("value_counts", {
 })
 
 test_that("any.Array and any.ChunkedArray", {
-  
   data <- c(1:10, NA, NA)
 
-  expect_vector_equal(any(input > 5), data)
+  expect_vector_equal(any(input > 5, na.rm = TRUE), data)
   expect_vector_equal(any(input < 1), data)
   expect_vector_equal(any(input < 1, na.rm = TRUE), data)
   
   data_logical <- c(TRUE, FALSE, TRUE, NA, FALSE)
   
-  expect_vector_equal(any(input), data_logical)
+  expect_vector_equal(any(input, na.rm = TRUE), data_logical)

Review comment:
       This line and the one below are now identical. Let's please test all three cases (`na.rm` not specified, set to `FALSE`, and set to `TRUE`:
   ```suggestion
     expect_vector_equal(any(input), data_logical)
     expect_vector_equal(any(input, na.rm = FALSE), data_logical)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r647632845



##########
File path: r/tests/testthat/test-compute-aggregate.R
##########
@@ -383,33 +383,30 @@ test_that("value_counts", {
 })
 
 test_that("any.Array and any.ChunkedArray", {
-  
   data <- c(1:10, NA, NA)
 
-  expect_vector_equal(any(input > 5), data)
+  expect_vector_equal(any(input > 5, na.rm = TRUE), data)
   expect_vector_equal(any(input < 1), data)
   expect_vector_equal(any(input < 1, na.rm = TRUE), data)
   
   data_logical <- c(TRUE, FALSE, TRUE, NA, FALSE)

Review comment:
       To properly test that `any()` is handling `na.rm` consistent with the way base R does, we should test with a vector that does not have any `TRUE` in it. Maybe let's add another few lines of test code here to verify that, and perserve the existing tests too:
   ```suggestion
     data_logical <- c(TRUE, FALSE, TRUE, NA, FALSE)
     expect_vector_equal(any(input), data_logical)
     expect_vector_equal(any(input, na.rm = FALSE), data_logical)
     expect_vector_equal(any(input, na.rm = TRUE), data_logical)
     
     data_logical <- c(TRUE, FALSE, TRUE, NA, FALSE)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r647622152



##########
File path: r/tests/testthat/test-compute-aggregate.R
##########
@@ -383,33 +383,30 @@ test_that("value_counts", {
 })
 
 test_that("any.Array and any.ChunkedArray", {
-  
   data <- c(1:10, NA, NA)
 
-  expect_vector_equal(any(input > 5), data)
+  expect_vector_equal(any(input > 5, na.rm = TRUE), data)

Review comment:
       Does this test fail without the `, na.rm = TRUE`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r651808497



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -201,30 +201,44 @@ Result<Datum> MinMax(
 /// \brief Test whether any element in a boolean array evaluates to true.
 ///
 /// This function returns true if any of the elements in the array evaluates
-/// to true and false otherwise. Null values are skipped.
+/// to true and false otherwise. Null values are ignored by default.
+/// If null values are taken into account by setting ScalarAggregateOptions
+/// parameter skip_nulls = false then Kleene logic is used.
+/// See KleeneOr for more details on Kleene logic.
 ///
 /// \param[in] value input datum, expecting a boolean array
+/// \param[in] options see ScalarAggregateOptions for more information
 /// \param[in] ctx the function execution context, optional
 /// \return resulting datum as a BooleanScalar
 ///
 /// \since 3.0.0
 /// \note API not yet finalized
 ARROW_EXPORT
-Result<Datum> Any(const Datum& value, ExecContext* ctx = NULLPTR);
+Result<Datum> Any(
+    const Datum& value,
+    const ScalarAggregateOptions& options = ScalarAggregateOptions::Defaults(),
+    ExecContext* ctx = NULLPTR);
 
 /// \brief Test whether all elements in a boolean array evaluate to true.
 ///
 /// This function returns true if all of the elements in the array evaluate
-/// to true and false otherwise. Null values are skipped.
+/// to true and false otherwise. Null values are ignored by default.
+/// If null values are taken into account by setting ScalarAggregateOptions
+/// parameter skip_nulls = false then Kleene logic is used.
+/// See KleeneOr for more details on Kleene logic.

Review comment:
       ```suggestion
   /// See KleeneAnd for more details on Kleene logic.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org