You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/25 04:29:42 UTC

[GitHub] [arrow] cyb70289 opened a new pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

cyb70289 opened a new pull request #8269:
URL: https://github.com/apache/arrow/pull/8269


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r496679885



##########
File path: cpp/src/arrow/compute/kernels/aggregate_var_std.cc
##########
@@ -0,0 +1,200 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/kernels/aggregate_basic_internal.h"
+
+namespace arrow {
+namespace compute {
+namespace aggregate {
+
+namespace {
+
+// Based on https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
+template <typename ArrowType>
+struct VarStdState {
+  using ArrayType = typename TypeTraits<ArrowType>::ArrayType;
+  using c_type = typename ArrowType::c_type;
+  using ThisType = VarStdState<ArrowType>;
+
+  // Calculate variance of one chunk with `two pass algorithm`
+  // Always use `double` to calculate variance for any array type
+  void Consume(const ArrayType& array) {
+    int64_t count = array.length() - array.null_count();
+    if (count == 0) {
+      return;
+    }
+
+    double sum = 0;
+    VisitArrayDataInline<ArrowType>(
+        *array.data(), [&sum](c_type value) { sum += static_cast<double>(value); },
+        []() {});
+
+    double mean = sum / count, m2 = 0;
+    VisitArrayDataInline<ArrowType>(
+        *array.data(),
+        [mean, &m2](c_type value) {
+          double v = static_cast<double>(value);
+          m2 += (v - mean) * (v - mean);
+        },
+        []() {});
+
+    this->count = count;
+    this->sum = sum;
+    this->m2 = m2;
+  }
+
+  // Combine variance from two chunks
+  void MergeFrom(const ThisType& state) {
+    if (state.count == 0) {
+      return;
+    }
+    if (this->count == 0) {
+      this->count = state.count;
+      this->sum = state.sum;
+      this->m2 = state.m2;
+      return;
+    }
+    double delta = this->sum / this->count - state.sum / state.count;
+    this->m2 += state.m2 +
+                delta * delta * this->count * state.count / (this->count + state.count);

Review comment:
       Hmm... can you document the source for this formula? This doesn't seem to correspond:
   https://www.emathzone.com/tutorials/basic-statistics/combined-variance.html
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-698714057


   https://issues.apache.org/jira/browse/ARROW-10070


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kszucs edited a comment on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
kszucs edited a comment on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-698948177


   > missing license header?
   
   Yes, https://github.com/apache/arrow/pull/8273 should fix that.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r496673112



##########
File path: docs/source/cpp/compute.rst
##########
@@ -140,8 +140,12 @@ Aggregations
 +--------------------------+------------+--------------------+-----------------------+--------------------------------------------+
 | mode                     | Unary      | Numeric            | Scalar Struct  (2)    |                                            |
 +--------------------------+------------+--------------------+-----------------------+--------------------------------------------+
+| std                      | Unary      | Numeric            | Scalar Float64        | :struct:`VarStdOptions`                    |
++--------------------------+------------+--------------------+-----------------------+--------------------------------------------+
 | sum                      | Unary      | Numeric            | Scalar Numeric (3)    |                                            |
 +--------------------------+------------+--------------------+-----------------------+--------------------------------------------+
+| var                      | Unary      | Numeric            | Scalar Float64        | :struct:`VarStdOptions`                    |

Review comment:
       Would rather "stddev" and "variance" respectively. "std" and "var" can easily be misunderstood, IMHO.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-699450469


   We might want to have an option to specify the denominator? (whether it is `n` or `n - 1`, to compute population vs sample standard deviation) As some examples, [numpy](https://numpy.org/doc/stable/reference/generated/numpy.std.html) has a `ddof` keyword, [postgres](https://www.postgresql.org/docs/9.1/functions-aggregate.html) or [clickhouse](https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/reference/stddevpop/) have separate functions for both, [julia](https://docs.julialang.org/en/v1/stdlib/Statistics/#Statistics.std) has a `corrected` keyword.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 edited a comment on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 edited a comment on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-698754806


   Dev / Lint ci failure looks not related.
   ```
   INFO:archery:Running Docker linter
   apache-rat license violation: go/arrow/flight/Flight_grpc.pb.go
   apache-rat license violation: go/arrow/flight/example_flight_server_test.go
   Error: `docker-compose --file /home/runner/work/arrow/arrow/docker-compose.yml run --rm ubuntu-lint` exited with a non-zero exit code 1, see the process log above.
   ```
   missing license header?
   Will it related to pr https://github.com/apache/arrow/pull/8175? @zeroshade  @wesm 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r497282421



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -76,6 +76,18 @@ struct ARROW_EXPORT MinMaxOptions : public FunctionOptions {
   enum Mode null_handling = SKIP;
 };
 
+/// \brief Control Delta Degrees of Freedom (ddof) of Variance and Stddev kernel
+///
+/// The divisor used in calculations is N - ddof, where N is the number of elements.
+/// By default, ddof is zero, and population variance or stddev is returned.
+struct ARROW_EXPORT VarStdOptions : public FunctionOptions {
+  explicit VarStdOptions(int ddof = 0) : ddof(ddof) {}

Review comment:
       I agree there might be other options in the future. But with the renaming of the functions, maybe call it `VarianceOptions` instead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-700432621


   Added variance kernel `var`. Renamed standard deviation kernel to `std`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r496686620



##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -914,5 +914,127 @@ TEST_F(TestInt32ModeKernel, LargeValueRange) {
   CheckModeWithRange<ArrowType>(-10000000, 10000000);
 }
 
+//
+// Var/Std
+//
+
+template <typename ArrowType>
+class TestPrimitiveVarStdKernel : public ::testing::Test {
+ public:
+  using Traits = TypeTraits<ArrowType>;
+  using ScalarType = typename TypeTraits<DoubleType>::ScalarType;
+
+  void AssertVarStdIs(const Datum& array, const VarStdOptions& options,
+                      double expected_var, double diff = 0) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_TRUE(var->is_valid && std->is_valid);
+    ASSERT_DOUBLE_EQ(std->value * std->value, var->value);
+    if (diff == 0) {
+      ASSERT_DOUBLE_EQ(var->value, expected_var);  // < 4ULP
+    } else {
+      ASSERT_NEAR(var->value, expected_var, diff);
+    }
+  }
+
+  void AssertVarStdIs(const std::string& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(array, options, expected_var);
+  }
+
+  void AssertVarStdIs(const std::vector<std::string>& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto chunked = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(chunked, options, expected_var);
+  }
+
+  void AssertVarStdIsInvalid(const Datum& array, const VarStdOptions& options) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_FALSE(var->is_valid || std->is_valid);
+  }
+
+  void AssertVarStdIsInvalid(const std::string& json, const VarStdOptions& options) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIsInvalid(array, options);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+template <typename ArrowType>
+class TestNumericVarStdKernel : public TestPrimitiveVarStdKernel<ArrowType> {};
+
+// Reference value from numpy.var
+TYPED_TEST_SUITE(TestNumericVarStdKernel, NumericArrowTypes);
+TYPED_TEST(TestNumericVarStdKernel, Basics) {
+  VarStdOptions options;  // ddof = 0, population var/std
+
+  this->AssertVarStdIs("[100]", options, 0);
+  this->AssertVarStdIs("[1, 2, 3]", options, 0.6666666666666666);
+  this->AssertVarStdIs("[null, 1, 2, null, 3]", options, 0.6666666666666666);
+
+  this->AssertVarStdIs({"[]", "[1]", "[2]", "[null]", "[3]"}, options,
+                       0.6666666666666666);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 5.25);
+
+  this->AssertVarStdIsInvalid("[null, null, null]", options);
+  this->AssertVarStdIsInvalid("[]", options);
+
+  options.ddof = 1;  // sample var/std
+
+  this->AssertVarStdIs("[1, 2]", options, 0.5);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 6.0);

Review comment:
       Same here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-698754806


   Dev / Lint ci failure looks not related.
   ```
   INFO:archery:Running Docker linter
   apache-rat license violation: go/arrow/flight/Flight_grpc.pb.go
   apache-rat license violation: go/arrow/flight/example_flight_server_test.go
   Error: `docker-compose --file /home/runner/work/arrow/arrow/docker-compose.yml run --rm ubuntu-lint` exited with a non-zero exit code 1, see the process log above.
   ```
   missing license header?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r496780078



##########
File path: cpp/src/arrow/compute/kernels/aggregate_var_std.cc
##########
@@ -0,0 +1,200 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/kernels/aggregate_basic_internal.h"
+
+namespace arrow {
+namespace compute {
+namespace aggregate {
+
+namespace {
+
+// Based on https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
+template <typename ArrowType>
+struct VarStdState {
+  using ArrayType = typename TypeTraits<ArrowType>::ArrayType;
+  using c_type = typename ArrowType::c_type;
+  using ThisType = VarStdState<ArrowType>;
+
+  // Calculate variance of one chunk with `two pass algorithm`
+  // Always use `double` to calculate variance for any array type
+  void Consume(const ArrayType& array) {
+    int64_t count = array.length() - array.null_count();
+    if (count == 0) {
+      return;
+    }
+
+    double sum = 0;
+    VisitArrayDataInline<ArrowType>(
+        *array.data(), [&sum](c_type value) { sum += static_cast<double>(value); },
+        []() {});
+
+    double mean = sum / count, m2 = 0;
+    VisitArrayDataInline<ArrowType>(
+        *array.data(),
+        [mean, &m2](c_type value) {
+          double v = static_cast<double>(value);
+          m2 += (v - mean) * (v - mean);
+        },
+        []() {});
+
+    this->count = count;
+    this->sum = sum;
+    this->m2 = m2;
+  }
+
+  // Combine variance from two chunks
+  void MergeFrom(const ThisType& state) {
+    if (state.count == 0) {
+      return;
+    }
+    if (this->count == 0) {
+      this->count = state.count;
+      this->sum = state.sum;
+      this->m2 = state.m2;
+      return;
+    }
+    double delta = this->sum / this->count - state.sum / state.count;
+    this->m2 += state.m2 +
+                delta * delta * this->count * state.count / (this->count + state.count);

Review comment:
       The comment `// Combine variance from two chunks` is misleading.
   Actually, this code is combining `m2` (which equals `sum((X-mean)^2)`) from two chunks, not the variance. See https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
   Will refine comments.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r496686115



##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -914,5 +914,127 @@ TEST_F(TestInt32ModeKernel, LargeValueRange) {
   CheckModeWithRange<ArrowType>(-10000000, 10000000);
 }
 
+//
+// Var/Std
+//
+
+template <typename ArrowType>
+class TestPrimitiveVarStdKernel : public ::testing::Test {
+ public:
+  using Traits = TypeTraits<ArrowType>;
+  using ScalarType = typename TypeTraits<DoubleType>::ScalarType;
+
+  void AssertVarStdIs(const Datum& array, const VarStdOptions& options,
+                      double expected_var, double diff = 0) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_TRUE(var->is_valid && std->is_valid);
+    ASSERT_DOUBLE_EQ(std->value * std->value, var->value);
+    if (diff == 0) {
+      ASSERT_DOUBLE_EQ(var->value, expected_var);  // < 4ULP
+    } else {
+      ASSERT_NEAR(var->value, expected_var, diff);
+    }
+  }
+
+  void AssertVarStdIs(const std::string& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(array, options, expected_var);
+  }
+
+  void AssertVarStdIs(const std::vector<std::string>& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto chunked = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(chunked, options, expected_var);
+  }
+
+  void AssertVarStdIsInvalid(const Datum& array, const VarStdOptions& options) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_FALSE(var->is_valid || std->is_valid);
+  }
+
+  void AssertVarStdIsInvalid(const std::string& json, const VarStdOptions& options) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIsInvalid(array, options);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+template <typename ArrowType>
+class TestNumericVarStdKernel : public TestPrimitiveVarStdKernel<ArrowType> {};
+
+// Reference value from numpy.var
+TYPED_TEST_SUITE(TestNumericVarStdKernel, NumericArrowTypes);
+TYPED_TEST(TestNumericVarStdKernel, Basics) {
+  VarStdOptions options;  // ddof = 0, population var/std
+
+  this->AssertVarStdIs("[100]", options, 0);
+  this->AssertVarStdIs("[1, 2, 3]", options, 0.6666666666666666);
+  this->AssertVarStdIs("[null, 1, 2, null, 3]", options, 0.6666666666666666);
+
+  this->AssertVarStdIs({"[]", "[1]", "[2]", "[null]", "[3]"}, options,
+                       0.6666666666666666);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 5.25);

Review comment:
       Can we also test with chunks of very different sizes? For example `[1, 2, 3, 4, 5, 6, 7]` and `[8]`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-701355724


   Since we're not doing anything special, it certainly will. I think that can be tackled in a separate JIRA, when adding a nan handling option perhaps.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r497276818



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -76,6 +76,18 @@ struct ARROW_EXPORT MinMaxOptions : public FunctionOptions {
   enum Mode null_handling = SKIP;
 };
 
+/// \brief Control Delta Degrees of Freedom (ddof) of Variance and Stddev kernel
+///
+/// The divisor used in calculations is N - ddof, where N is the number of elements.
+/// By default, ddof is zero, and population variance or stddev is returned.
+struct ARROW_EXPORT VarStdOptions : public FunctionOptions {
+  explicit VarStdOptions(int ddof = 0) : ddof(ddof) {}

Review comment:
       I'm not sure if we will add options later, e.g., to ignore NaN




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r497409316



##########
File path: cpp/src/arrow/compute/kernels/aggregate_var_std.cc
##########
@@ -0,0 +1,200 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/kernels/aggregate_basic_internal.h"
+
+namespace arrow {
+namespace compute {
+namespace aggregate {
+
+namespace {
+
+// Based on https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
+template <typename ArrowType>
+struct VarStdState {
+  using ArrayType = typename TypeTraits<ArrowType>::ArrayType;
+  using c_type = typename ArrowType::c_type;
+  using ThisType = VarStdState<ArrowType>;
+
+  // Calculate variance of one chunk with `two pass algorithm`
+  // Always use `double` to calculate variance for any array type
+  void Consume(const ArrayType& array) {
+    int64_t count = array.length() - array.null_count();
+    if (count == 0) {
+      return;
+    }
+
+    double sum = 0;
+    VisitArrayDataInline<ArrowType>(
+        *array.data(), [&sum](c_type value) { sum += static_cast<double>(value); },
+        []() {});
+
+    double mean = sum / count, m2 = 0;
+    VisitArrayDataInline<ArrowType>(
+        *array.data(),
+        [mean, &m2](c_type value) {
+          double v = static_cast<double>(value);
+          m2 += (v - mean) * (v - mean);
+        },
+        []() {});
+
+    this->count = count;
+    this->sum = sum;
+    this->m2 = m2;
+  }
+
+  // Combine variance from two chunks
+  void MergeFrom(const ThisType& state) {
+    if (state.count == 0) {
+      return;
+    }
+    if (this->count == 0) {
+      this->count = state.count;
+      this->sum = state.sum;
+      this->m2 = state.m2;
+      return;
+    }
+    double delta = this->sum / this->count - state.sum / state.count;
+    this->m2 += state.m2 +
+                delta * delta * this->count * state.count / (this->count + state.count);

Review comment:
       Thank you!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r497254587



##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -914,5 +914,127 @@ TEST_F(TestInt32ModeKernel, LargeValueRange) {
   CheckModeWithRange<ArrowType>(-10000000, 10000000);
 }
 
+//
+// Var/Std
+//
+
+template <typename ArrowType>
+class TestPrimitiveVarStdKernel : public ::testing::Test {
+ public:
+  using Traits = TypeTraits<ArrowType>;
+  using ScalarType = typename TypeTraits<DoubleType>::ScalarType;
+
+  void AssertVarStdIs(const Datum& array, const VarStdOptions& options,
+                      double expected_var, double diff = 0) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_TRUE(var->is_valid && std->is_valid);
+    ASSERT_DOUBLE_EQ(std->value * std->value, var->value);
+    if (diff == 0) {
+      ASSERT_DOUBLE_EQ(var->value, expected_var);  // < 4ULP
+    } else {
+      ASSERT_NEAR(var->value, expected_var, diff);
+    }
+  }
+
+  void AssertVarStdIs(const std::string& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(array, options, expected_var);
+  }
+
+  void AssertVarStdIs(const std::vector<std::string>& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto chunked = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(chunked, options, expected_var);
+  }
+
+  void AssertVarStdIsInvalid(const Datum& array, const VarStdOptions& options) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_FALSE(var->is_valid || std->is_valid);
+  }
+
+  void AssertVarStdIsInvalid(const std::string& json, const VarStdOptions& options) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIsInvalid(array, options);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+template <typename ArrowType>
+class TestNumericVarStdKernel : public TestPrimitiveVarStdKernel<ArrowType> {};
+
+// Reference value from numpy.var
+TYPED_TEST_SUITE(TestNumericVarStdKernel, NumericArrowTypes);
+TYPED_TEST(TestNumericVarStdKernel, Basics) {
+  VarStdOptions options;  // ddof = 0, population var/std
+
+  this->AssertVarStdIs("[100]", options, 0);
+  this->AssertVarStdIs("[1, 2, 3]", options, 0.6666666666666666);
+  this->AssertVarStdIs("[null, 1, 2, null, 3]", options, 0.6666666666666666);
+
+  this->AssertVarStdIs({"[]", "[1]", "[2]", "[null]", "[3]"}, options,
+                       0.6666666666666666);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 5.25);

Review comment:
       done

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -914,5 +914,127 @@ TEST_F(TestInt32ModeKernel, LargeValueRange) {
   CheckModeWithRange<ArrowType>(-10000000, 10000000);
 }
 
+//
+// Var/Std
+//
+
+template <typename ArrowType>
+class TestPrimitiveVarStdKernel : public ::testing::Test {
+ public:
+  using Traits = TypeTraits<ArrowType>;
+  using ScalarType = typename TypeTraits<DoubleType>::ScalarType;
+
+  void AssertVarStdIs(const Datum& array, const VarStdOptions& options,
+                      double expected_var, double diff = 0) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_TRUE(var->is_valid && std->is_valid);
+    ASSERT_DOUBLE_EQ(std->value * std->value, var->value);
+    if (diff == 0) {
+      ASSERT_DOUBLE_EQ(var->value, expected_var);  // < 4ULP
+    } else {
+      ASSERT_NEAR(var->value, expected_var, diff);
+    }
+  }
+
+  void AssertVarStdIs(const std::string& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(array, options, expected_var);
+  }
+
+  void AssertVarStdIs(const std::vector<std::string>& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto chunked = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(chunked, options, expected_var);
+  }
+
+  void AssertVarStdIsInvalid(const Datum& array, const VarStdOptions& options) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_FALSE(var->is_valid || std->is_valid);
+  }
+
+  void AssertVarStdIsInvalid(const std::string& json, const VarStdOptions& options) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIsInvalid(array, options);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+template <typename ArrowType>
+class TestNumericVarStdKernel : public TestPrimitiveVarStdKernel<ArrowType> {};
+
+// Reference value from numpy.var
+TYPED_TEST_SUITE(TestNumericVarStdKernel, NumericArrowTypes);
+TYPED_TEST(TestNumericVarStdKernel, Basics) {
+  VarStdOptions options;  // ddof = 0, population var/std
+
+  this->AssertVarStdIs("[100]", options, 0);
+  this->AssertVarStdIs("[1, 2, 3]", options, 0.6666666666666666);
+  this->AssertVarStdIs("[null, 1, 2, null, 3]", options, 0.6666666666666666);
+
+  this->AssertVarStdIs({"[]", "[1]", "[2]", "[null]", "[3]"}, options,
+                       0.6666666666666666);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 5.25);
+
+  this->AssertVarStdIsInvalid("[null, null, null]", options);
+  this->AssertVarStdIsInvalid("[]", options);
+
+  options.ddof = 1;  // sample var/std
+
+  this->AssertVarStdIs("[1, 2]", options, 0.5);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 6.0);

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r496685479



##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -914,5 +914,127 @@ TEST_F(TestInt32ModeKernel, LargeValueRange) {
   CheckModeWithRange<ArrowType>(-10000000, 10000000);
 }
 
+//
+// Var/Std
+//
+
+template <typename ArrowType>
+class TestPrimitiveVarStdKernel : public ::testing::Test {
+ public:
+  using Traits = TypeTraits<ArrowType>;
+  using ScalarType = typename TypeTraits<DoubleType>::ScalarType;
+
+  void AssertVarStdIs(const Datum& array, const VarStdOptions& options,
+                      double expected_var, double diff = 0) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_TRUE(var->is_valid && std->is_valid);
+    ASSERT_DOUBLE_EQ(std->value * std->value, var->value);
+    if (diff == 0) {
+      ASSERT_DOUBLE_EQ(var->value, expected_var);  // < 4ULP
+    } else {
+      ASSERT_NEAR(var->value, expected_var, diff);
+    }
+  }
+
+  void AssertVarStdIs(const std::string& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(array, options, expected_var);
+  }
+
+  void AssertVarStdIs(const std::vector<std::string>& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto chunked = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(chunked, options, expected_var);
+  }
+
+  void AssertVarStdIsInvalid(const Datum& array, const VarStdOptions& options) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_FALSE(var->is_valid || std->is_valid);
+  }
+
+  void AssertVarStdIsInvalid(const std::string& json, const VarStdOptions& options) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIsInvalid(array, options);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+template <typename ArrowType>
+class TestNumericVarStdKernel : public TestPrimitiveVarStdKernel<ArrowType> {};
+
+// Reference value from numpy.var
+TYPED_TEST_SUITE(TestNumericVarStdKernel, NumericArrowTypes);
+TYPED_TEST(TestNumericVarStdKernel, Basics) {
+  VarStdOptions options;  // ddof = 0, population var/std
+
+  this->AssertVarStdIs("[100]", options, 0);
+  this->AssertVarStdIs("[1, 2, 3]", options, 0.6666666666666666);
+  this->AssertVarStdIs("[null, 1, 2, null, 3]", options, 0.6666666666666666);
+
+  this->AssertVarStdIs({"[]", "[1]", "[2]", "[null]", "[3]"}, options,
+                       0.6666666666666666);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 5.25);
+
+  this->AssertVarStdIsInvalid("[null, null, null]", options);
+  this->AssertVarStdIsInvalid("[]", options);
+
+  options.ddof = 1;  // sample var/std
+
+  this->AssertVarStdIs("[1, 2]", options, 0.5);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 6.0);
+
+  this->AssertVarStdIsInvalid("[100]", options);
+  this->AssertVarStdIsInvalid("[100, null, null]", options);
+}
+
+class TestVarStdKernelStability : public TestPrimitiveVarStdKernel<DoubleType> {};
+
+// Test numerical stability
+TEST_F(TestVarStdKernelStability, Basics) {
+  VarStdOptions options{1};  // ddof = 1
+  this->AssertVarStdIs("[100000004, 100000007, 100000013, 100000016]", options, 30.0);
+  this->AssertVarStdIs("[1000000004, 1000000007, 1000000013, 1000000016]", options, 30.0);
+}
+
+// Calculate reference variance with welford online algorithm
+double WelfordVar(const Array& array) {
+  const auto& array_numeric = reinterpret_cast<const DoubleArray&>(array);
+  const auto values = array_numeric.raw_values();
+  internal::BitmapReader reader(array.null_bitmap_data(), array.offset(), array.length());
+  double count = 0, mean = 0, m2 = 0;
+  for (int64_t i = 0; i < array.length(); ++i) {
+    if (reader.IsSet()) {
+      ++count;
+      double delta = values[i] - mean;
+      mean += delta / count;
+      double delta2 = values[i] - mean;
+      m2 += delta * delta2;
+    }
+    reader.Next();
+  }
+  return m2 / count;
+}
+
+class TestVarStdKernelRandom : public TestPrimitiveVarStdKernel<DoubleType> {};
+
+TEST_F(TestVarStdKernelRandom, Basics) {
+  auto rand = random::RandomArrayGenerator(0x5487656);
+  auto array = rand.Numeric<DoubleType>(40000, -10000.0, 100000.0, 0.1);

Review comment:
       Ironically, the fact that this is random identically-distributed data means that the slices will have similar means and variances.  Perhaps making the array smaller would make the test more significant.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r497254649



##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -914,5 +914,127 @@ TEST_F(TestInt32ModeKernel, LargeValueRange) {
   CheckModeWithRange<ArrowType>(-10000000, 10000000);
 }
 
+//
+// Var/Std
+//
+
+template <typename ArrowType>
+class TestPrimitiveVarStdKernel : public ::testing::Test {
+ public:
+  using Traits = TypeTraits<ArrowType>;
+  using ScalarType = typename TypeTraits<DoubleType>::ScalarType;
+
+  void AssertVarStdIs(const Datum& array, const VarStdOptions& options,
+                      double expected_var, double diff = 0) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_TRUE(var->is_valid && std->is_valid);
+    ASSERT_DOUBLE_EQ(std->value * std->value, var->value);
+    if (diff == 0) {
+      ASSERT_DOUBLE_EQ(var->value, expected_var);  // < 4ULP
+    } else {
+      ASSERT_NEAR(var->value, expected_var, diff);
+    }
+  }
+
+  void AssertVarStdIs(const std::string& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(array, options, expected_var);
+  }
+
+  void AssertVarStdIs(const std::vector<std::string>& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto chunked = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(chunked, options, expected_var);
+  }
+
+  void AssertVarStdIsInvalid(const Datum& array, const VarStdOptions& options) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_FALSE(var->is_valid || std->is_valid);
+  }
+
+  void AssertVarStdIsInvalid(const std::string& json, const VarStdOptions& options) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIsInvalid(array, options);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+template <typename ArrowType>
+class TestNumericVarStdKernel : public TestPrimitiveVarStdKernel<ArrowType> {};
+
+// Reference value from numpy.var
+TYPED_TEST_SUITE(TestNumericVarStdKernel, NumericArrowTypes);
+TYPED_TEST(TestNumericVarStdKernel, Basics) {
+  VarStdOptions options;  // ddof = 0, population var/std
+
+  this->AssertVarStdIs("[100]", options, 0);
+  this->AssertVarStdIs("[1, 2, 3]", options, 0.6666666666666666);
+  this->AssertVarStdIs("[null, 1, 2, null, 3]", options, 0.6666666666666666);
+
+  this->AssertVarStdIs({"[]", "[1]", "[2]", "[null]", "[3]"}, options,
+                       0.6666666666666666);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 5.25);
+
+  this->AssertVarStdIsInvalid("[null, null, null]", options);
+  this->AssertVarStdIsInvalid("[]", options);
+
+  options.ddof = 1;  // sample var/std
+
+  this->AssertVarStdIs("[1, 2]", options, 0.5);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 6.0);
+
+  this->AssertVarStdIsInvalid("[100]", options);
+  this->AssertVarStdIsInvalid("[100, null, null]", options);
+}
+
+class TestVarStdKernelStability : public TestPrimitiveVarStdKernel<DoubleType> {};
+
+// Test numerical stability
+TEST_F(TestVarStdKernelStability, Basics) {
+  VarStdOptions options{1};  // ddof = 1
+  this->AssertVarStdIs("[100000004, 100000007, 100000013, 100000016]", options, 30.0);
+  this->AssertVarStdIs("[1000000004, 1000000007, 1000000013, 1000000016]", options, 30.0);
+}
+
+// Calculate reference variance with welford online algorithm
+double WelfordVar(const Array& array) {
+  const auto& array_numeric = reinterpret_cast<const DoubleArray&>(array);
+  const auto values = array_numeric.raw_values();
+  internal::BitmapReader reader(array.null_bitmap_data(), array.offset(), array.length());
+  double count = 0, mean = 0, m2 = 0;
+  for (int64_t i = 0; i < array.length(); ++i) {
+    if (reader.IsSet()) {
+      ++count;
+      double delta = values[i] - mean;
+      mean += delta / count;
+      double delta2 = values[i] - mean;
+      m2 += delta * delta2;
+    }
+    reader.Next();
+  }
+  return m2 / count;
+}
+
+class TestVarStdKernelRandom : public TestPrimitiveVarStdKernel<DoubleType> {};
+
+TEST_F(TestVarStdKernelRandom, Basics) {
+  auto rand = random::RandomArrayGenerator(0x5487656);
+  auto array = rand.Numeric<DoubleType>(40000, -10000.0, 100000.0, 0.1);
+  auto chunked = *ChunkedArray::Make(
+      {array->Slice(0, 1000), array->Slice(1000, 9000), array->Slice(10000, 30000)});
+  double expected_var = WelfordVar(*array);
+
+  VarStdOptions options;
+  this->AssertVarStdIs(chunked, options, expected_var, 0.0001);

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-699509272


   > We might want to have an option to specify the denominator? (whether it is `n` or `n - 1`, to compute population vs sample standard deviation)
   
   Thank you, will do.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-700081921


   Two notes:
   
   1. Naming: I've never seen this called `stdev` anywhere. `stddev` is common, in numpy and julia it's `std`, in R it's `sd`. Let's go with one of those. Maybe just add an extra "d"?
   2. Since `sd = sqrt(var)` (https://github.com/apache/arrow/pull/8269/files#diff-461bd7e445c2a190f1173ebdefa21002R106), would it make sense to implement variance (i.e. most of this patch), and then standard deviation as the sqrt of that? That way we get two kernels (or even three, if sqrt is exposed as a kernel too).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #8269:
URL: https://github.com/apache/arrow/pull/8269


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] arw2019 commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
arw2019 commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r497268296



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -76,6 +76,18 @@ struct ARROW_EXPORT MinMaxOptions : public FunctionOptions {
   enum Mode null_handling = SKIP;
 };
 
+/// \brief Control Delta Degrees of Freedom (ddof) of Variance and Stddev kernel
+///
+/// The divisor used in calculations is N - ddof, where N is the number of elements.
+/// By default, ddof is zero, and population variance or stddev is returned.
+struct ARROW_EXPORT VarStdOptions : public FunctionOptions {
+  explicit VarStdOptions(int ddof = 0) : ddof(ddof) {}

Review comment:
       Naming suggestion: how about `DdofOptions`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kszucs commented on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
kszucs commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-698948177


   > missing license header?
   Yes, https://github.com/apache/arrow/pull/8273 should fix that.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-698714057


   https://issues.apache.org/jira/browse/ARROW-10070


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r497254422



##########
File path: docs/source/cpp/compute.rst
##########
@@ -140,8 +140,12 @@ Aggregations
 +--------------------------+------------+--------------------+-----------------------+--------------------------------------------+
 | mode                     | Unary      | Numeric            | Scalar Struct  (2)    |                                            |
 +--------------------------+------------+--------------------+-----------------------+--------------------------------------------+
+| std                      | Unary      | Numeric            | Scalar Float64        | :struct:`VarStdOptions`                    |
++--------------------------+------------+--------------------+-----------------------+--------------------------------------------+
 | sum                      | Unary      | Numeric            | Scalar Numeric (3)    |                                            |
 +--------------------------+------------+--------------------+-----------------------+--------------------------------------------+
+| var                      | Unary      | Numeric            | Scalar Float64        | :struct:`VarStdOptions`                    |

Review comment:
       done

##########
File path: cpp/src/arrow/compute/kernels/aggregate_var_std.cc
##########
@@ -0,0 +1,200 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/kernels/aggregate_basic_internal.h"
+
+namespace arrow {
+namespace compute {
+namespace aggregate {
+
+namespace {
+
+// Based on https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
+template <typename ArrowType>
+struct VarStdState {
+  using ArrayType = typename TypeTraits<ArrowType>::ArrayType;
+  using c_type = typename ArrowType::c_type;
+  using ThisType = VarStdState<ArrowType>;
+
+  // Calculate variance of one chunk with `two pass algorithm`
+  // Always use `double` to calculate variance for any array type
+  void Consume(const ArrayType& array) {
+    int64_t count = array.length() - array.null_count();
+    if (count == 0) {
+      return;
+    }
+
+    double sum = 0;
+    VisitArrayDataInline<ArrowType>(
+        *array.data(), [&sum](c_type value) { sum += static_cast<double>(value); },
+        []() {});
+
+    double mean = sum / count, m2 = 0;
+    VisitArrayDataInline<ArrowType>(
+        *array.data(),
+        [mean, &m2](c_type value) {
+          double v = static_cast<double>(value);
+          m2 += (v - mean) * (v - mean);
+        },
+        []() {});
+
+    this->count = count;
+    this->sum = sum;
+    this->m2 = m2;
+  }
+
+  // Combine variance from two chunks
+  void MergeFrom(const ThisType& state) {
+    if (state.count == 0) {
+      return;
+    }
+    if (this->count == 0) {
+      this->count = state.count;
+      this->sum = state.sum;
+      this->m2 = state.m2;
+      return;
+    }
+    double delta = this->sum / this->count - state.sum / state.count;
+    this->m2 += state.m2 +
+                delta * delta * this->count * state.count / (this->count + state.count);

Review comment:
       done

##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -914,5 +914,127 @@ TEST_F(TestInt32ModeKernel, LargeValueRange) {
   CheckModeWithRange<ArrowType>(-10000000, 10000000);
 }
 
+//
+// Var/Std
+//
+
+template <typename ArrowType>
+class TestPrimitiveVarStdKernel : public ::testing::Test {
+ public:
+  using Traits = TypeTraits<ArrowType>;
+  using ScalarType = typename TypeTraits<DoubleType>::ScalarType;
+
+  void AssertVarStdIs(const Datum& array, const VarStdOptions& options,
+                      double expected_var, double diff = 0) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_TRUE(var->is_valid && std->is_valid);
+    ASSERT_DOUBLE_EQ(std->value * std->value, var->value);
+    if (diff == 0) {
+      ASSERT_DOUBLE_EQ(var->value, expected_var);  // < 4ULP
+    } else {
+      ASSERT_NEAR(var->value, expected_var, diff);
+    }
+  }
+
+  void AssertVarStdIs(const std::string& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(array, options, expected_var);
+  }
+
+  void AssertVarStdIs(const std::vector<std::string>& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto chunked = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(chunked, options, expected_var);
+  }
+
+  void AssertVarStdIsInvalid(const Datum& array, const VarStdOptions& options) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_FALSE(var->is_valid || std->is_valid);
+  }
+
+  void AssertVarStdIsInvalid(const std::string& json, const VarStdOptions& options) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIsInvalid(array, options);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+template <typename ArrowType>
+class TestNumericVarStdKernel : public TestPrimitiveVarStdKernel<ArrowType> {};
+
+// Reference value from numpy.var
+TYPED_TEST_SUITE(TestNumericVarStdKernel, NumericArrowTypes);
+TYPED_TEST(TestNumericVarStdKernel, Basics) {
+  VarStdOptions options;  // ddof = 0, population var/std
+
+  this->AssertVarStdIs("[100]", options, 0);
+  this->AssertVarStdIs("[1, 2, 3]", options, 0.6666666666666666);
+  this->AssertVarStdIs("[null, 1, 2, null, 3]", options, 0.6666666666666666);
+
+  this->AssertVarStdIs({"[]", "[1]", "[2]", "[null]", "[3]"}, options,
+                       0.6666666666666666);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 5.25);
+
+  this->AssertVarStdIsInvalid("[null, null, null]", options);
+  this->AssertVarStdIsInvalid("[]", options);
+
+  options.ddof = 1;  // sample var/std
+
+  this->AssertVarStdIs("[1, 2]", options, 0.5);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 6.0);
+
+  this->AssertVarStdIsInvalid("[100]", options);
+  this->AssertVarStdIsInvalid("[100, null, null]", options);
+}
+
+class TestVarStdKernelStability : public TestPrimitiveVarStdKernel<DoubleType> {};
+
+// Test numerical stability
+TEST_F(TestVarStdKernelStability, Basics) {
+  VarStdOptions options{1};  // ddof = 1
+  this->AssertVarStdIs("[100000004, 100000007, 100000013, 100000016]", options, 30.0);
+  this->AssertVarStdIs("[1000000004, 1000000007, 1000000013, 1000000016]", options, 30.0);
+}
+
+// Calculate reference variance with welford online algorithm
+double WelfordVar(const Array& array) {
+  const auto& array_numeric = reinterpret_cast<const DoubleArray&>(array);
+  const auto values = array_numeric.raw_values();
+  internal::BitmapReader reader(array.null_bitmap_data(), array.offset(), array.length());
+  double count = 0, mean = 0, m2 = 0;
+  for (int64_t i = 0; i < array.length(); ++i) {
+    if (reader.IsSet()) {
+      ++count;
+      double delta = values[i] - mean;
+      mean += delta / count;
+      double delta2 = values[i] - mean;
+      m2 += delta * delta2;
+    }
+    reader.Next();
+  }
+  return m2 / count;
+}
+
+class TestVarStdKernelRandom : public TestPrimitiveVarStdKernel<DoubleType> {};
+
+TEST_F(TestVarStdKernelRandom, Basics) {
+  auto rand = random::RandomArrayGenerator(0x5487656);
+  auto array = rand.Numeric<DoubleType>(40000, -10000.0, 100000.0, 0.1);

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-700379771


   Thanks @nealrichardson 
   
   >     1. Naming: I've never seen this called `stdev` anywhere. `stddev` is common, in numpy and julia it's `std`, in R it's `sd`. Let's go with one of those. Maybe just add an extra "d"?
   
   Naming is always the hardest thing :) Looks `std` is used more often, and it's short.
   AFAIK, `stdev` is used in excel (the most popular statistic software I guess? :)
   
   >     2. Since `sd = sqrt(var)` (https://github.com/apache/arrow/pull/8269/files#diff-461bd7e445c2a190f1173ebdefa21002R106), would it make sense to implement variance (i.e. most of this patch), and then standard deviation as the sqrt of that? That way we get two kernels (or even three, if sqrt is exposed as a kernel too).
   
   I also thought about the `var` kernel. Will update this patch to include it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r497408938



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -76,6 +76,18 @@ struct ARROW_EXPORT MinMaxOptions : public FunctionOptions {
   enum Mode null_handling = SKIP;
 };
 
+/// \brief Control Delta Degrees of Freedom (ddof) of Variance and Stddev kernel
+///
+/// The divisor used in calculations is N - ddof, where N is the number of elements.
+/// By default, ddof is zero, and population variance or stddev is returned.
+struct ARROW_EXPORT VarStdOptions : public FunctionOptions {
+  explicit VarStdOptions(int ddof = 0) : ddof(ddof) {}

Review comment:
       Only one is useful. Just `VarianceOptions` IMHO.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r496687434



##########
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##########
@@ -914,5 +914,127 @@ TEST_F(TestInt32ModeKernel, LargeValueRange) {
   CheckModeWithRange<ArrowType>(-10000000, 10000000);
 }
 
+//
+// Var/Std
+//
+
+template <typename ArrowType>
+class TestPrimitiveVarStdKernel : public ::testing::Test {
+ public:
+  using Traits = TypeTraits<ArrowType>;
+  using ScalarType = typename TypeTraits<DoubleType>::ScalarType;
+
+  void AssertVarStdIs(const Datum& array, const VarStdOptions& options,
+                      double expected_var, double diff = 0) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_TRUE(var->is_valid && std->is_valid);
+    ASSERT_DOUBLE_EQ(std->value * std->value, var->value);
+    if (diff == 0) {
+      ASSERT_DOUBLE_EQ(var->value, expected_var);  // < 4ULP
+    } else {
+      ASSERT_NEAR(var->value, expected_var, diff);
+    }
+  }
+
+  void AssertVarStdIs(const std::string& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(array, options, expected_var);
+  }
+
+  void AssertVarStdIs(const std::vector<std::string>& json, const VarStdOptions& options,
+                      double expected_var) {
+    auto chunked = ChunkedArrayFromJSON(type_singleton(), json);
+    AssertVarStdIs(chunked, options, expected_var);
+  }
+
+  void AssertVarStdIsInvalid(const Datum& array, const VarStdOptions& options) {
+    ASSERT_OK_AND_ASSIGN(Datum out_var, Var(array, options));
+    ASSERT_OK_AND_ASSIGN(Datum out_std, Std(array, options));
+    auto var = checked_cast<const ScalarType*>(out_var.scalar().get());
+    auto std = checked_cast<const ScalarType*>(out_std.scalar().get());
+    ASSERT_FALSE(var->is_valid || std->is_valid);
+  }
+
+  void AssertVarStdIsInvalid(const std::string& json, const VarStdOptions& options) {
+    auto array = ArrayFromJSON(type_singleton(), json);
+    AssertVarStdIsInvalid(array, options);
+  }
+
+  std::shared_ptr<DataType> type_singleton() { return Traits::type_singleton(); }
+};
+
+template <typename ArrowType>
+class TestNumericVarStdKernel : public TestPrimitiveVarStdKernel<ArrowType> {};
+
+// Reference value from numpy.var
+TYPED_TEST_SUITE(TestNumericVarStdKernel, NumericArrowTypes);
+TYPED_TEST(TestNumericVarStdKernel, Basics) {
+  VarStdOptions options;  // ddof = 0, population var/std
+
+  this->AssertVarStdIs("[100]", options, 0);
+  this->AssertVarStdIs("[1, 2, 3]", options, 0.6666666666666666);
+  this->AssertVarStdIs("[null, 1, 2, null, 3]", options, 0.6666666666666666);
+
+  this->AssertVarStdIs({"[]", "[1]", "[2]", "[null]", "[3]"}, options,
+                       0.6666666666666666);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 5.25);
+
+  this->AssertVarStdIsInvalid("[null, null, null]", options);
+  this->AssertVarStdIsInvalid("[]", options);
+
+  options.ddof = 1;  // sample var/std
+
+  this->AssertVarStdIs("[1, 2]", options, 0.5);
+  this->AssertVarStdIs({"[1, 2, 3]", "[4, 5, 6]", "[7, 8]"}, options, 6.0);
+
+  this->AssertVarStdIsInvalid("[100]", options);
+  this->AssertVarStdIsInvalid("[100, null, null]", options);
+}
+
+class TestVarStdKernelStability : public TestPrimitiveVarStdKernel<DoubleType> {};
+
+// Test numerical stability
+TEST_F(TestVarStdKernelStability, Basics) {
+  VarStdOptions options{1};  // ddof = 1
+  this->AssertVarStdIs("[100000004, 100000007, 100000013, 100000016]", options, 30.0);
+  this->AssertVarStdIs("[1000000004, 1000000007, 1000000013, 1000000016]", options, 30.0);
+}
+
+// Calculate reference variance with welford online algorithm
+double WelfordVar(const Array& array) {
+  const auto& array_numeric = reinterpret_cast<const DoubleArray&>(array);
+  const auto values = array_numeric.raw_values();
+  internal::BitmapReader reader(array.null_bitmap_data(), array.offset(), array.length());
+  double count = 0, mean = 0, m2 = 0;
+  for (int64_t i = 0; i < array.length(); ++i) {
+    if (reader.IsSet()) {
+      ++count;
+      double delta = values[i] - mean;
+      mean += delta / count;
+      double delta2 = values[i] - mean;
+      m2 += delta * delta2;
+    }
+    reader.Next();
+  }
+  return m2 / count;
+}
+
+class TestVarStdKernelRandom : public TestPrimitiveVarStdKernel<DoubleType> {};
+
+TEST_F(TestVarStdKernelRandom, Basics) {
+  auto rand = random::RandomArrayGenerator(0x5487656);
+  auto array = rand.Numeric<DoubleType>(40000, -10000.0, 100000.0, 0.1);
+  auto chunked = *ChunkedArray::Make(
+      {array->Slice(0, 1000), array->Slice(1000, 9000), array->Slice(10000, 30000)});
+  double expected_var = WelfordVar(*array);
+
+  VarStdOptions options;
+  this->AssertVarStdIs(chunked, options, expected_var, 0.0001);

Review comment:
       Can we also test with `ddof = 1`? Though with such a large array size, the change may be rather small.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#discussion_r497288932



##########
File path: cpp/src/arrow/compute/api_aggregate.h
##########
@@ -76,6 +76,18 @@ struct ARROW_EXPORT MinMaxOptions : public FunctionOptions {
   enum Mode null_handling = SKIP;
 };
 
+/// \brief Control Delta Degrees of Freedom (ddof) of Variance and Stddev kernel
+///
+/// The divisor used in calculations is N - ddof, where N is the number of elements.
+/// By default, ddof is zero, and population variance or stddev is returned.
+struct ARROW_EXPORT VarStdOptions : public FunctionOptions {
+  explicit VarStdOptions(int ddof = 0) : ddof(ddof) {}

Review comment:
       Kind of struggling as stddev kernel uses this same option.
   Maybe export both `VarianceOptions` and `StddevOptions` and alias them as same type internally?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8269: ARROW-10070: [C++][Compute] Implement var and std aggregate kernel

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-701343374


   I suppose the behaviour with NaN is that any NaN in the input gives NaN as result? That might be worth adding a test for?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 edited a comment on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 edited a comment on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-698754806


   Dev / Lint ci failure looks not related.
   ```
   INFO:archery:Running Docker linter
   apache-rat license violation: go/arrow/flight/Flight_grpc.pb.go
   apache-rat license violation: go/arrow/flight/example_flight_server_test.go
   Error: `docker-compose --file /home/runner/work/arrow/arrow/docker-compose.yml run --rm ubuntu-lint` exited with a non-zero exit code 1, see the process log above.
   ```
   missing license header?
   Will it related to pr https://github.com/apache/arrow/pull/8175? @zeroshade  @wesm 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-699582663


   Added option `ddof` to control stdev divisor. Same as [numpy.std](https://numpy.org/doc/stable/reference/generated/numpy.std.html).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #8269: ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #8269:
URL: https://github.com/apache/arrow/pull/8269#issuecomment-698754806


   Dev / Lint ci failure looks not related.
   ```
   INFO:archery:Running Docker linter
   apache-rat license violation: go/arrow/flight/Flight_grpc.pb.go
   apache-rat license violation: go/arrow/flight/example_flight_server_test.go
   Error: `docker-compose --file /home/runner/work/arrow/arrow/docker-compose.yml run --rm ubuntu-lint` exited with a non-zero exit code 1, see the process log above.
   ```
   missing license header?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org