You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/10 06:53:49 UTC

[GitHub] [arrow] cyb70289 commented on a change in pull request #9758: ARROW-9054: [C++] Add ScalarAggregateOptions

cyb70289 commented on a change in pull request #9758:
URL: https://github.com/apache/arrow/pull/9758#discussion_r629076290



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_internal.h
##########
@@ -256,8 +262,8 @@ struct MinMaxImpl : public ScalarAggregator {
     using ScalarType = typename TypeTraits<ArrowType>::ScalarType;
 
     std::vector<std::shared_ptr<Scalar>> values;
-    if (!state.has_values ||
-        (state.has_nulls && options.null_handling == MinMaxOptions::EMIT_NULL)) {
+    if (options.min_count > 0 &&

Review comment:
       Current code returns maximal numerical limit for `min([], min_count=0)`, this is strange.
   I think `min_count = 0` is not a valid case for min_max, we can verify the options and fail if it's 0.
   Further more, `min_count` here is only checked for 0 or non 0, so basically it's useless if 0 is invalid. We can simply ignore it. Or improve min_max kernel (count valid values) as sum kernel does.

##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##########
@@ -75,48 +75,55 @@ struct CountImpl : public ScalarAggregator {
 
   Status Finalize(KernelContext* ctx, Datum* out) override {
     const auto& state = checked_cast<const CountImpl&>(*ctx->state());
-    switch (state.options.count_mode) {
-      case CountOptions::COUNT_NON_NULL:
-        *out = Datum(state.non_nulls);
-        break;
-      case CountOptions::COUNT_NULL:
-        *out = Datum(state.nulls);
-        break;
-      default:
-        return Status::Invalid("Unknown CountOptions encountered");
+    if (state.options.skip_nulls) {
+      *out = Datum(state.non_nulls);
+    } else {
+      *out = Datum(state.nulls);

Review comment:
       `skip_nulls = true` matches `COUNT_NON_NULL`, the meaning is clear.
   
   `skip_nulls = false` is a bit confusing. E.g., `count(list, skip_nulls=False)` should return the total length including nulls, but it actually returns only #nulls.
   I'm okay to accept current code, just wondering if there's better way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org