You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/07 15:35:45 UTC
[GitHub] [arrow] nealrichardson commented on a change in pull request #10190: ARROW-11515: [R] Bindings for strsplit

nealrichardson commented on a change in pull request #10190:
URL: https://github.com/apache/arrow/pull/10190#discussion_r628314240



##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -239,7 +254,172 @@ test_that("str_replace and str_replace_all", {
 
 })
 
-test_that("backreferences in pattern", {
+test_that("strsplit and str_split", {
+
+  df <- tibble(x = c("Foo and bar", "baz and qux and quux"))
+
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = strsplit(x, "and")) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = strsplit(x, "and.*", fixed = TRUE)) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, "and")) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, "and", n = 2)) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, fixed("and"), n = 2)) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, regex("and"), n = 2)) %>%
+      collect(),
+    df
+  )
+
+})
+
+test_that("arrow_*_split_whitespace functions", {
+
+  # use only ASCII whitespace characters
+  df_ascii <- tibble(x = c("Foo\nand bar", "baz\tand qux and quux"))
+
+  # use only non-ASCII whitespace characters
+  df_utf8 <- tibble(x = c("Foo\u00A0and\u2000bar", "baz\u2006and\u1680qux\u3000and\u2008quux"))
+
+  df_split <- tibble(x = list(c("Foo", "and", "bar"), c("baz", "and", "qux", "and", "quux")))
+
+  expect_equivalent(
+    df_ascii %>%
+      Table$create() %>%
+      mutate(x = arrow_ascii_split_whitespace(x)) %>%
+      collect(),
+    df_split
+  )
+  expect_equivalent(
+    df_utf8 %>%
+      Table$create() %>%
+      mutate(x = arrow_utf8_split_whitespace(x)) %>%
+      collect(),
+    df_split
+  )
+
+})
+
+test_that("errors and warnings in string splitting", {
+  df <- tibble(x = c("Foo and bar", "baz and qux and quux"))
+
+  # These conditions generate an error, but abandon_ship() catches the error,
+  # issues a warning, and pulls the data into R
+  expect_warning(
+    df %>%
+      Table$create() %>%
+      mutate(x = strsplit(x, "and.*", fixed = FALSE)) %>%
+      collect(),
+    regexp = "not supported"

Review comment:
       "not supported" does not uniquely tell you which validation is catching this (assuming you're testing different inputs to test each of the validations you added). Since you're not asserting anything about the actual output, and we're testing elsewhere that `abandon_ship`, and `mutate` for that matter, work, maybe these should be more like unit tests. 
   
   Something like: 
   
   ```r
   x <- Expression$field_ref("x")
   expect_warning(
     dplyr_functions$dataset$strsplit(x, "and.*", fixed = FALSE),
     "The full message"
   )
   ```

##########
File path: r/R/dplyr.R
##########
@@ -539,6 +541,44 @@ arrow_stringr_string_replace_function <- function(FUN, max_replacements) {
   }
 }
 
+arrow_r_string_split_function <- function(FUN, reverse = FALSE, max_splits = -1) {
+  function(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE) {
+    
+    assert_that(is.string(split))
+    
+    # if !fixed but no regex metachars in split pattern, allow to proceed as split isn't regex
+    if (!fixed && contains_regex(split)) {
+      stop("Regular expression matching not supported in strsplit for Arrow", call. = FALSE)
+    }
+    if (fixed && perl) {
+      warning("Argument 'perl = TRUE' will be ignored", call. = FALSE)
+    }
+    FUN("split_pattern", x, options = list(pattern = split, reverse = reverse, max_splits = max_splits))
+  }
+}
+
+arrow_stringr_string_split_function <- function(FUN, reverse = FALSE) {
+  function(string, pattern, n = Inf, simplify = FALSE) {
+    opts <- get_stringr_pattern_options(enexpr(pattern))
+    if (!opts$fixed && contains_regex(opts$pattern)) {
+      stop("Regular expression matching not supported in str_split() for Arrow", call. = FALSE)
+    }
+    if (opts$ignore_case) {
+      stop("Case-insensitive string splitting not supported in Arrow", call. = FALSE)
+    }
+    if (n == 0) {
+      stop("Splitting strings into zero parts not supported in Arrow" , call. = FALSE)
+    }
+    if (identical(n, Inf)) {
+      n <- 0L
+    }
+    if (simplify) {
+      warning("Argument 'simplify = TRUE' will be ignored", call. = FALSE)
+    }
+    FUN("split_pattern", string, options = list(pattern = opts$pattern, reverse = reverse, max_splits = n - 1L))

Review comment:
       Are we sure that this max_splits should be -1? Like, if I want max 2 splits, I should provide `max_splits = 1`? That sounds quite wrong.
   
   Can you point me at the tests for this behavior? (I don't think there are tests of this from R in this PR, but there should be C++ tests.)

##########
File path: r/R/dplyr.R
##########
@@ -539,6 +541,44 @@ arrow_stringr_string_replace_function <- function(FUN, max_replacements) {
   }
 }
 
+arrow_r_string_split_function <- function(FUN, reverse = FALSE, max_splits = -1) {
+  function(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE) {
+    
+    assert_that(is.string(split))
+    
+    # if !fixed but no regex metachars in split pattern, allow to proceed as split isn't regex
+    if (!fixed && contains_regex(split)) {
+      stop("Regular expression matching not supported in strsplit for Arrow", call. = FALSE)
+    }
+    if (fixed && perl) {

Review comment:
       Do we only ignore the `perl` argument if `fixed = TRUE`? 

##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -239,7 +254,172 @@ test_that("str_replace and str_replace_all", {
 
 })
 
-test_that("backreferences in pattern", {
+test_that("strsplit and str_split", {
+
+  df <- tibble(x = c("Foo and bar", "baz and qux and quux"))
+
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = strsplit(x, "and")) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = strsplit(x, "and.*", fixed = TRUE)) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, "and")) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, "and", n = 2)) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, fixed("and"), n = 2)) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, regex("and"), n = 2)) %>%
+      collect(),
+    df
+  )
+
+})
+
+test_that("arrow_*_split_whitespace functions", {
+
+  # use only ASCII whitespace characters
+  df_ascii <- tibble(x = c("Foo\nand bar", "baz\tand qux and quux"))
+
+  # use only non-ASCII whitespace characters
+  df_utf8 <- tibble(x = c("Foo\u00A0and\u2000bar", "baz\u2006and\u1680qux\u3000and\u2008quux"))
+
+  df_split <- tibble(x = list(c("Foo", "and", "bar"), c("baz", "and", "qux", "and", "quux")))
+
+  expect_equivalent(
+    df_ascii %>%
+      Table$create() %>%
+      mutate(x = arrow_ascii_split_whitespace(x)) %>%

Review comment:
       I think the only reason you'd want to test these functions in the R package is to test that you're passing the extra C++ options through (that you've wired that up in compute.cpp correctly). So let's test them.

##########
File path: r/src/compute.cpp
##########
@@ -233,6 +233,33 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
                                      max_replacements);
   }
 
+  if (func_name == "split_pattern") {
+    using Options = arrow::compute::SplitPatternOptions;
+    int64_t max_splits = -1;

Review comment:
       Rather than setting defaults manually, can you do `auto out = std::make_shared<Options>(Options::Defaults());` and then just set them if they're provided? (We do this above for other types.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org