You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/01 00:41:53 UTC

[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816352943



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,94 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done

Review comment:
       We also might want to add that this is only due to the way that testthat runs – that any normal use of arrow will not have to do this explicitly.

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }
+      interim_x <- call_binding("strptime", x, format, unit = "s")
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting
+        # TODO revisit if arrow decides to support double -> date casting
+        x <- call_binding("floor", x)
+        interim_x <- build_expr("cast", x, options = (cast_options(to_type = int32())))
+      } else {
+        interim_x <- x
+      }

Review comment:
       Is it possible to overwrite `x` instead of using a separate object? That would let us get rid of this `else` clause, at least. Which might make this set of nested if/elses a bit easier to parse

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }

Review comment:
       Could you describe a little bit more about what cases this is catching? 
   
   I ask because I suspect that this is catching with the message we are testing:
   
   ```
   expect_error(
       test_df %>%
         arrow_table() %>%
         mutate(date_char = date(character_ymd_var)) %>%
         collect(),
       regexp = "Unsupported cast from string to date32 using function cast_date32"
     )
   ```
   
   I was expecting something like this: 
   
   ```
   > tab <- arrow_table(data.frame(time = "2020-01-01 00:01:00", date = "2020-01-01"))
   > tab %>% mutate(strptime(time, format = "%Y-%m-%d")) %>% collect()
   Error in `handle_csv_read_error()`:
   ! Invalid: Failed to parse string: '2020-01-01 00:01:00' as a scalar of type timestamp[ms]
   Run `rlang::last_error()` to see where the error occurred.
   ```
   
   So I suspect that this `if()` is catching this before that, which I'm a little surprised by. So anyway, what cases are we expecting here?

##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,94 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date = date(a)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date_base = as.Date(a)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(b = date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(b_base = as.Date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+})
+
+test_that("date() errors with unsupported inputs", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56,
+    boolean_var = TRUE
+  )
+
+  # date from integer supported in arrow (similar to base::as.Date()), but in
+  # Arrow it assumes a fixed origin "1970-01-01"
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      select(integer_var) %>%
+      mutate(date_int = date(integer_var)) %>%
+      collect(),
+    tibble(integer_var = 32L,
+           date_int = as.Date("1970-02-02"))
+  )

Review comment:
       This is minor, but should this be in this block? It seems a bit odd to have a block labeled "date() errors with unsupported inputs" that has an `expect_equal()` without any error in it

##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,94 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))

Review comment:
       I know we don't have to differentiate things here, but would it be possible to name this something more descriptive than `a` regardless? 

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }
+      interim_x <- call_binding("strptime", x, format, unit = "s")
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting

Review comment:
       Using `floor` here is fine, but you should dig into how|why you get the error you're talking about here — there is a way around it if you wanted to do that!

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }

Review comment:
       And separately — we should make a Jira if one doesn't already exist to see if we can allow `strptime(time, format = "%Y-%m-%d")` to work (which does in R)
   
   ```
   > strptime("2020-01-01 00:01:00", "%Y-%m-%d")
   [1] "2020-01-01 CST"
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org