You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/30 16:11:06 UTC

[GitHub] [arrow] nealrichardson commented on a change in pull request #10215: ARROW-12199: [R] bindings for stddev, variance

nealrichardson commented on a change in pull request #10215:
URL: https://github.com/apache/arrow/pull/10215#discussion_r623993426



##########
File path: r/R/compute.R
##########
@@ -267,6 +267,25 @@ value_counts <- function(x) {
   call_function("value_counts", x)
 }
 
+
+#' `variance` and `stddev` for Arrow objects

Review comment:
       I'm not sure about this. Unfortunately, `sd()` and `var()` aren't generics so we can't just define methods for them. So it might not be worth adding these wrappers at all.

##########
File path: r/src/compute.cpp
##########
@@ -232,7 +232,12 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
                                      cpp11::as_cpp<std::string>(options["replacement"]),
                                      max_replacements);
   }
-
+  
+  if (func_name == "variance" || func_name == "stddev") {

Review comment:
       TBH this is probably the only code addition we want to keep here.

##########
File path: r/R/compute.R
##########
@@ -267,6 +267,25 @@ value_counts <- function(x) {
   call_function("value_counts", x)
 }
 
+
+#' `variance` and `stddev` for Arrow objects
+#'
+#' These functions calculate the variance and standard deviation of Arrow arrays
+#' @param x `Array` or `ChunkedArray`
+#' @param ddof The divisor used in calculations is N - ddof, where N is the number of elements. 
+#' By default, ddof is zero, and population variance or stddev is returned. 
+#' @return A `Scalar` containing the calculated value.
+#' @export
+stddev <- function(x, ddof = 0) {
+  call_function("stddev", x, options = list(ddof = ddof))

Review comment:
       Is there no `na.rm` handling in the Arrow stddev and variance functions? If not, there should be (please JIRA). 

##########
File path: r/R/dplyr.R
##########
@@ -480,6 +480,18 @@ build_function_list <- function(FUN) {
     between = function(x, left, right) {
       x >= left & x <= right
     },
+    sd = function(x, na.rm = FALSE){
+      if (!na.rm && x$null_count > 0) {
+        return(Scalar$create(NA_real_))
+      }

Review comment:
       We don't support aggregations in our dplyr backend yet, so this should never succeed. If `sd()` doesn't cleanly and always error when called on an arrow Expression, we should force it to--see the "fail" handling inside of `arrow_eval` where this is done for `mean`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org