You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/18 21:14:29 UTC

[GitHub] [arrow] lidavidm commented on a change in pull request #10547: ARROW-11514: [R] Bindings for paste(), paste0(), str_c()

lidavidm commented on a change in pull request #10547:
URL: https://github.com/apache/arrow/pull/10547#discussion_r654502869



##########
File path: r/R/dplyr-functions.R
##########
@@ -215,6 +215,54 @@ nse_funcs$nchar <- function(x, type = "chars", allowNA = FALSE, keepNA = NA) {
   }
 }
 
+nse_funcs$paste <- function(..., sep = " ", collapse = NULL, recycle0 = FALSE) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$REPLACE, "NA")(..., sep)
+}
+
+nse_funcs$paste0 <- function(..., collapse = NULL, recycle0 = FALSE) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$REPLACE, "NA")(..., "")
+}
+
+nse_funcs$str_c <- function(..., sep = "", collapse = NULL) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$EMIT_NULL)(..., sep)
+}
+
+arrow_string_join_function <- function(null_handling, null_replacement = NULL) {
+  # the `binary_join_element_wise` Arrow C++ compute kernel takes the separator
+  # as the last argument, so pass `sep` as the last dots arg to this function
+  function(...) {
+    args <- lapply(list(...), function(arg) {
+      # handle scalar literal args, and cast all args to string for
+      # consistency with base::paste(), base::paste0(), and stringr::str_c()
+      if (!inherits(arg, "Expression")) {
+        assert_that(
+          length(arg) == 1,
+          msg = "Literal vectors of length != 1 not supported in string concatenation"
+        )
+        # handle scalar literal NA consistent with the binary_join_element_wise
+        # kernel's handling of nulls in the data
+        if (null_handling == NullHandlingBehavior$REPLACE && is.na(arg)) {
+          arg <- null_replacement
+        }

Review comment:
       I need to dig more but it looks like at some point, this gets converted into a dataset scan of an in-memory dataset, where we project a column called `paste0(x, NA_character_, z)` to `null` (so something else made it null before it ever got evaluated).

##########
File path: r/R/dplyr-functions.R
##########
@@ -215,6 +215,54 @@ nse_funcs$nchar <- function(x, type = "chars", allowNA = FALSE, keepNA = NA) {
   }
 }
 
+nse_funcs$paste <- function(..., sep = " ", collapse = NULL, recycle0 = FALSE) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$REPLACE, "NA")(..., sep)
+}
+
+nse_funcs$paste0 <- function(..., collapse = NULL, recycle0 = FALSE) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$REPLACE, "NA")(..., "")
+}
+
+nse_funcs$str_c <- function(..., sep = "", collapse = NULL) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$EMIT_NULL)(..., sep)
+}
+
+arrow_string_join_function <- function(null_handling, null_replacement = NULL) {
+  # the `binary_join_element_wise` Arrow C++ compute kernel takes the separator
+  # as the last argument, so pass `sep` as the last dots arg to this function
+  function(...) {
+    args <- lapply(list(...), function(arg) {
+      # handle scalar literal args, and cast all args to string for
+      # consistency with base::paste(), base::paste0(), and stringr::str_c()
+      if (!inherits(arg, "Expression")) {
+        assert_that(
+          length(arg) == 1,
+          msg = "Literal vectors of length != 1 not supported in string concatenation"
+        )
+        # handle scalar literal NA consistent with the binary_join_element_wise
+        # kernel's handling of nulls in the data
+        if (null_handling == NullHandlingBehavior$REPLACE && is.na(arg)) {
+          arg <- null_replacement
+        }

Review comment:
       Ah, it's because we simplify the projection expression and in FoldConstants, we check for functions that handle nulls with INTERSECTION (i.e. any null inputs map to a null output). It looks like I incorrectly marked this kernel so it's simplifying this expression to null before evaluating.

##########
File path: r/R/dplyr-functions.R
##########
@@ -215,6 +215,54 @@ nse_funcs$nchar <- function(x, type = "chars", allowNA = FALSE, keepNA = NA) {
   }
 }
 
+nse_funcs$paste <- function(..., sep = " ", collapse = NULL, recycle0 = FALSE) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$REPLACE, "NA")(..., sep)
+}
+
+nse_funcs$paste0 <- function(..., collapse = NULL, recycle0 = FALSE) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$REPLACE, "NA")(..., "")
+}
+
+nse_funcs$str_c <- function(..., sep = "", collapse = NULL) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$EMIT_NULL)(..., sep)
+}
+
+arrow_string_join_function <- function(null_handling, null_replacement = NULL) {
+  # the `binary_join_element_wise` Arrow C++ compute kernel takes the separator
+  # as the last argument, so pass `sep` as the last dots arg to this function
+  function(...) {
+    args <- lapply(list(...), function(arg) {
+      # handle scalar literal args, and cast all args to string for
+      # consistency with base::paste(), base::paste0(), and stringr::str_c()
+      if (!inherits(arg, "Expression")) {
+        assert_that(
+          length(arg) == 1,
+          msg = "Literal vectors of length != 1 not supported in string concatenation"
+        )
+        # handle scalar literal NA consistent with the binary_join_element_wise
+        # kernel's handling of nulls in the data
+        if (null_handling == NullHandlingBehavior$REPLACE && is.na(arg)) {
+          arg <- null_replacement
+        }

Review comment:
       If you want to apply this patch, that should fix things:
   
   ```diff
   diff --git a/cpp/src/arrow/compute/kernels/scalar_string.cc b/cpp/src/arrow/compute/kernels/scalar_string.cc
   index 3f63bf2c4..7fe1fe092 100644
   --- a/cpp/src/arrow/compute/kernels/scalar_string.cc
   +++ b/cpp/src/arrow/compute/kernels/scalar_string.cc
   @@ -3587,10 +3587,14 @@ void AddBinaryJoin(FunctionRegistry* registry) {
            "binary_join_element_wise", Arity::VarArgs(/*min_args=*/1),
            &binary_join_element_wise_doc, &kDefaultJoinOptions);
        for (const auto& ty : BaseBinaryTypes()) {
   -      DCHECK_OK(
   -          func->AddKernel({InputType(ty)}, ty,
   +      ScalarKernel kernel{KernelSignature::Make({InputType(ty)}, ty, /*is_varargs=*/true),
                              GenerateTypeAgnosticVarBinaryBase<BinaryJoinElementWise>(ty),
   -                          BinaryJoinElementWiseState::Init));
   +                          BinaryJoinElementWiseState::Init};
   +      // This is redundant but expression simplification uses this to potentially replace
   +      // calls with null
   +      kernel.null_handling = NullHandling::COMPUTED_NO_PREALLOCATE;
   +      kernel.mem_allocation = MemAllocation::NO_PREALLOCATE;
   +      DCHECK_OK(func->AddKernel(std::move(kernel)));
        }
        DCHECK_OK(registry->AddFunction(std::move(func)));
      }
   diff --git a/r/R/dplyr-functions.R b/r/R/dplyr-functions.R
   index 78f83ef9d..bb21ee3f1 100644
   --- a/r/R/dplyr-functions.R
   +++ b/r/R/dplyr-functions.R
   @@ -242,11 +242,6 @@ arrow_string_join_function <- function(null_handling, null_replacement = NULL) {
              length(arg) == 1,
              msg = "Literal vectors of length != 1 not supported in string concatenation"
            )
   -        # handle scalar literal NA consistent with the binary_join_element_wise
   -        # kernel's handling of nulls in the data
   -        if (null_handling == NullHandlingBehavior$REPLACE && is.na(arg)) {
   -          arg <- null_replacement
   -        }
            Expression$scalar(as.character(arg))
          } else {
            nse_funcs$as.character(arg)
   ```

##########
File path: r/R/dplyr-functions.R
##########
@@ -215,6 +215,54 @@ nse_funcs$nchar <- function(x, type = "chars", allowNA = FALSE, keepNA = NA) {
   }
 }
 
+nse_funcs$paste <- function(..., sep = " ", collapse = NULL, recycle0 = FALSE) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$REPLACE, "NA")(..., sep)
+}
+
+nse_funcs$paste0 <- function(..., collapse = NULL, recycle0 = FALSE) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$REPLACE, "NA")(..., "")
+}
+
+nse_funcs$str_c <- function(..., sep = "", collapse = NULL) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$EMIT_NULL)(..., sep)
+}
+
+arrow_string_join_function <- function(null_handling, null_replacement = NULL) {
+  # the `binary_join_element_wise` Arrow C++ compute kernel takes the separator
+  # as the last argument, so pass `sep` as the last dots arg to this function
+  function(...) {
+    args <- lapply(list(...), function(arg) {
+      # handle scalar literal args, and cast all args to string for
+      # consistency with base::paste(), base::paste0(), and stringr::str_c()
+      if (!inherits(arg, "Expression")) {
+        assert_that(
+          length(arg) == 1,
+          msg = "Literal vectors of length != 1 not supported in string concatenation"
+        )
+        # handle scalar literal NA consistent with the binary_join_element_wise
+        # kernel's handling of nulls in the data
+        if (null_handling == NullHandlingBehavior$REPLACE && is.na(arg)) {
+          arg <- null_replacement
+        }

Review comment:
       Hmm. I got this:
   
   ```
   $ R -e 'library(arrow); library(dplyr); Table$create(x = "a", z = "c") %>% transmute(paste0(x, NA_character_, z)) %>% collect()'
   
   R version 4.0.3 (2020-10-10) -- "Bunny-Wunnies Freak Out"
   Copyright (C) 2020 The R Foundation for Statistical Computing
   Platform: x86_64-conda-linux-gnu (64-bit)
   <SNIP>
   > library(arrow); library(dplyr); Table$create(x = "a", z = "c") %>% transmute(paste0(x, NA_character_, z)) %>% collect()
   <SNIP>
   # A tibble: 1 x 1
     `paste0(x, NA_character_, z)`
     <chr>                        
   1 aNAc                         
   > 
   ```
   which is expected, right?
   
   Just to make sure, was Arrow-C++ and the R package fully rebuilt? Sometimes I find I have to `git clean -fdx .` before things get picked up.

##########
File path: r/R/dplyr-functions.R
##########
@@ -215,6 +215,54 @@ nse_funcs$nchar <- function(x, type = "chars", allowNA = FALSE, keepNA = NA) {
   }
 }
 
+nse_funcs$paste <- function(..., sep = " ", collapse = NULL, recycle0 = FALSE) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$REPLACE, "NA")(..., sep)
+}
+
+nse_funcs$paste0 <- function(..., collapse = NULL, recycle0 = FALSE) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$REPLACE, "NA")(..., "")
+}
+
+nse_funcs$str_c <- function(..., sep = "", collapse = NULL) {
+  assert_that(is.null(collapse))
+  arrow_string_join_function(NullHandlingBehavior$EMIT_NULL)(..., sep)
+}
+
+arrow_string_join_function <- function(null_handling, null_replacement = NULL) {
+  # the `binary_join_element_wise` Arrow C++ compute kernel takes the separator
+  # as the last argument, so pass `sep` as the last dots arg to this function
+  function(...) {
+    args <- lapply(list(...), function(arg) {
+      # handle scalar literal args, and cast all args to string for
+      # consistency with base::paste(), base::paste0(), and stringr::str_c()
+      if (!inherits(arg, "Expression")) {
+        assert_that(
+          length(arg) == 1,
+          msg = "Literal vectors of length != 1 not supported in string concatenation"
+        )
+        # handle scalar literal NA consistent with the binary_join_element_wise
+        # kernel's handling of nulls in the data
+        if (null_handling == NullHandlingBehavior$REPLACE && is.na(arg)) {
+          arg <- null_replacement
+        }

Review comment:
       Yes, sorry, I worded that comment badly - probably no comment needs to be there (as noted, it always allocates its own null bitmaps, it just wasn't correctly flagging that before, violating its contract).

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -3587,10 +3587,14 @@ void AddBinaryJoin(FunctionRegistry* registry) {
         "binary_join_element_wise", Arity::VarArgs(/*min_args=*/1),
         &binary_join_element_wise_doc, &kDefaultJoinOptions);
     for (const auto& ty : BaseBinaryTypes()) {
-      DCHECK_OK(
-          func->AddKernel({InputType(ty)}, ty,
+      ScalarKernel kernel{KernelSignature::Make({InputType(ty)}, ty, /*is_varargs=*/true),
                           GenerateTypeAgnosticVarBinaryBase<BinaryJoinElementWise>(ty),
-                          BinaryJoinElementWiseState::Init));
+                          BinaryJoinElementWiseState::Init};
+      // This is redundant but expression simplification uses this to potentially replace
+      // calls with null

Review comment:
       ```suggestion
   ```

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -3587,10 +3587,14 @@ void AddBinaryJoin(FunctionRegistry* registry) {
         "binary_join_element_wise", Arity::VarArgs(/*min_args=*/1),
         &binary_join_element_wise_doc, &kDefaultJoinOptions);
     for (const auto& ty : BaseBinaryTypes()) {
-      DCHECK_OK(
-          func->AddKernel({InputType(ty)}, ty,
+      ScalarKernel kernel{KernelSignature::Make({InputType(ty)}, ty, /*is_varargs=*/true),
                           GenerateTypeAgnosticVarBinaryBase<BinaryJoinElementWise>(ty),
-                          BinaryJoinElementWiseState::Init));
+                          BinaryJoinElementWiseState::Init};
+      // This is redundant but expression simplification uses this to potentially replace
+      // calls with null

Review comment:
       Sorry, my bad - let's not leave this misleading comment here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org