You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/18 14:36:18 UTC

[GitHub] [arrow] jonkeane commented on a diff in pull request #12738: ARROW-15800 [R] Implement bindings for `lubridate::as_date()` and `lubridate::as_datetime()`

jonkeane commented on code in PR #12738:
URL: https://github.com/apache/arrow/pull/12738#discussion_r852157664


##########
r/R/dplyr-funcs-datetime.R:
##########
@@ -372,3 +372,50 @@ binding_format_datetime <- function(x, format = "", tz = "", usetz = FALSE) {
 
   build_expr("strftime", x, options = list(format = format, locale = Sys.getlocale("LC_TIME")))
 }
+
+binding_as_date <- function(x,
+                            format = NULL,
+                            tryFormats = "%Y-%m-%d",
+                            origin = "1970-01-01",
+                            tz = "UTC",
+                            base = TRUE) {
+
+  if (is.null(format) && length(tryFormats) > 1) {
+    abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow")
+  }
+
+  if (call_binding("is.Date", x)) {
+    return(x)
+
+    # cast from POSIXct
+  } else if (call_binding("is.POSIXct", x)) {
+    # base::as.Date() first converts to the desired timezone and then extracts
+    # the date, which is why we need to go through timestamp() first
+    if (base || !is.null(tz)) {
+      x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+    }

Review Comment:
   Would it be possible to put this exceptional case in the binding for `as.Date` before then calling `binding_as_date`? I'm not 100% sure it would be cleaner, necessarily, but it would make this helper function much cleaner, and keep the exceptional part of the binding with the binding declaration bit.



##########
r/tests/testthat/test-dplyr-funcs-type.R:
##########
@@ -854,21 +874,94 @@ test_that("as.Date() converts successfully from date, timestamp, integer, char a
     fixed = TRUE
   )
 
-  # we do not support as.Date() with double/ float
-  compare_dplyr_binding(
-     .input %>%
+
+  # we do not support as.Date() with double/ float (error surfaced from C++)
+  # TODO revisit after https://issues.apache.org/jira/browse/ARROW-15798
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
       mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # we do not support as_date with double/ float (error surfaced from C++)
+  # TODO revisit after https://issues.apache.org/jira/browse/ARROW-15798

Review Comment:
   ```suggestion
     # TODO: revisit after https://issues.apache.org/jira/browse/ARROW-15798
   ```



##########
r/tests/testthat/test-dplyr-funcs-type.R:
##########
@@ -854,21 +874,94 @@ test_that("as.Date() converts successfully from date, timestamp, integer, char a
     fixed = TRUE
   )
 
-  # we do not support as.Date() with double/ float
-  compare_dplyr_binding(
-     .input %>%
+
+  # we do not support as.Date() with double/ float (error surfaced from C++)
+  # TODO revisit after https://issues.apache.org/jira/browse/ARROW-15798
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
       mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # we do not support as_date with double/ float (error surfaced from C++)
+  # TODO revisit after https://issues.apache.org/jira/browse/ARROW-15798
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as_date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # difference between as.Date() and as_date():
+  #`as.Date()` ignores the `tzone` attribute and uses the value of the `tz` arg
+  # to `as.Date()`
+  # `as_date()` does the opposite: uses the tzone attribute of the POSIXct object
+  # passsed if`tz` is NULL
+  compare_dplyr_binding(
+    .input %>%
+      transmute(
+        date_diff_lubridate = as_date(difference_date),
+        date_diff_base = as.Date(difference_date)
+      ) %>%
+      collect(),
+    test_df
+  )
+})
+
+test_that("`as_datetime()`", {
+  test_df <- tibble(
+    date = as.Date(c("2022-03-22", "2021-07-30", NA)),
+    char_date = c("2022-03-22", "2021-07-30 14:32:47", NA),
+    char_date_non_iso = c("2022-22-03 12:34:56", "2021-30-07 14:32:47", NA),
+    int_date = c(10L, 25L, NA),
+    integerish_date = c(10, 25, NA),
+    double_date = c(10.1, 25.2, NA)
+  )
+
+  test_df %>%
+    arrow_table() %>%
+    mutate(
+      ddate = as_datetime(date),
+      dchar_date_no_tz = as_datetime(char_date),
+      dchar_date_non_iso = as_datetime(char_date_non_iso, format = "%Y-%d-%m %H:%M:%S"),
+      dint_date = as_datetime(int_date, origin = "1970-01-02"),
+      dintegerish_date = as_datetime(integerish_date, origin = "1970-01-02"),
+      dintegerish_date2 = as_datetime(integerish_date, origin = "1970-01-01")
+    ) %>%
+    collect()
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        ddate = as_datetime(date),
+        dchar_date_no_tz = as_datetime(char_date),
+        dint_date = as_datetime(int_date, origin = "1970-01-02"),
+        dintegerish_date = as_datetime(integerish_date, origin = "1970-01-02"),
+        dintegerish_date2 = as_datetime(integerish_date, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # Arrow does not support conversion of double to date
+  # the below should error with an error message originating in the C++ code
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(
+        ddouble_date = as_datetime(double_date)
+      ) %>%
       collect(),
-     test_df,
-     warning = TRUE
+    regexp = "Float value 10.1 was truncated converting to int64"
   )
 
+  # separate tz test so we can skip on Windows
   skip_on_os("windows") # https://issues.apache.org/jira/browse/ARROW-13168

Review Comment:
   We shouldn't need to skip on windows anymore, right?



##########
r/R/dplyr-funcs-datetime.R:
##########
@@ -372,3 +372,50 @@ binding_format_datetime <- function(x, format = "", tz = "", usetz = FALSE) {
 
   build_expr("strftime", x, options = list(format = format, locale = Sys.getlocale("LC_TIME")))
 }
+
+binding_as_date <- function(x,
+                            format = NULL,
+                            tryFormats = "%Y-%m-%d",
+                            origin = "1970-01-01",
+                            tz = "UTC",
+                            base = TRUE) {
+
+  if (is.null(format) && length(tryFormats) > 1) {
+    abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow")
+  }
+
+  if (call_binding("is.Date", x)) {
+    return(x)
+
+    # cast from POSIXct
+  } else if (call_binding("is.POSIXct", x)) {
+    # base::as.Date() first converts to the desired timezone and then extracts
+    # the date, which is why we need to go through timestamp() first
+    if (base || !is.null(tz)) {
+      x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+    }
+    # POSIXct is of type double -> we need this to prevent going down the
+    # "double" branch
+    x <- x
+
+    # cast from character
+  } else if (call_binding("is.character", x)) {
+    format <- format %||% tryFormats[[1]]
+    # unit = 0L is the identifier for seconds in valid_time32_units
+    x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
+
+    # cast from numeric
+  } else if (call_binding("is.numeric", x) &
+             (!call_binding("is.integer", x) | origin != "1970-01-01")) {
+    # Arrow does not support direct casting from double to date32(), but for
+    # integer-like values we can go via int32()
+    # https://issues.apache.org/jira/browse/ARROW-15798
+    # TODO revisit if arrow decides to support double -> date casting
+    x <- build_expr("cast", x, options = cast_options(to_type = int32()))
+    delta_in_sec <- call_binding("difftime", origin, "1970-01-01")
+    delta_in_sec <- build_expr("cast", delta_in_sec, options = cast_options(to_type = int64()))
+    delta_in_days <- (delta_in_sec / 86400L)$cast(int32())
+    x <- build_expr("+", x, delta_in_days)

Review Comment:
   I'm a little bit confused about this section. I suspect that we might be conflating a few things together which is making this hard(er) to read than it needs to be:
   
   * For the condition, we trigger this block if `x` is numeric and either (or both) x is not an integer or the origin is non-default. And that means we are conflating our float work around with allowing a non-default origin. I bet if we separated these two out, that would make this clearer + easier to maintain. Feel free to break this out into a separate smaller help function if that's beneficial.
   * The casting on `delta_in_sec` seems off — is there a reason you want|need to case that to int64, and then later divide by something that can fit in int32, then bast it to int32? Could you start off as int32 on line 416?
    



##########
r/R/dplyr-funcs-datetime.R:
##########
@@ -372,3 +372,50 @@ binding_format_datetime <- function(x, format = "", tz = "", usetz = FALSE) {
 
   build_expr("strftime", x, options = list(format = format, locale = Sys.getlocale("LC_TIME")))
 }
+
+binding_as_date <- function(x,
+                            format = NULL,
+                            tryFormats = "%Y-%m-%d",
+                            origin = "1970-01-01",
+                            tz = "UTC",
+                            base = TRUE) {
+
+  if (is.null(format) && length(tryFormats) > 1) {
+    abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow")
+  }
+
+  if (call_binding("is.Date", x)) {
+    return(x)
+
+    # cast from POSIXct
+  } else if (call_binding("is.POSIXct", x)) {
+    # base::as.Date() first converts to the desired timezone and then extracts
+    # the date, which is why we need to go through timestamp() first
+    if (base || !is.null(tz)) {
+      x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+    }
+    # POSIXct is of type double -> we need this to prevent going down the
+    # "double" branch
+    x <- x

Review Comment:
   I'm not sure I follow what happens if we don't have this self assignment? This shouldn't be needed (and honestly shouldn't have any effect)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org