You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/29 13:18:10 UTC

[GitHub] [arrow] jonkeane commented on a diff in pull request #12589: ARROW-14848: [R] Implement bindings for lubridate's parse_date_time, parse_date_time2, and fast_strptime

jonkeane commented on code in PR #12589:
URL: https://github.com/apache/arrow/pull/12589#discussion_r861766168


##########
r/tests/testthat/test-dplyr-funcs-datetime.R:
##########
@@ -1471,3 +1471,128 @@ test_that("make_difftime()", {
     )
   )
 })
+
+test_that("parse_date_time()", {
+  test_df <- tibble(
+    dates = c("09-01-17", "02-Sep-17")
+  )
+
+  test_dates <- tibble::tibble(
+    string_ymd = c(
+      "2021-09-1", "2021/09///2", "2021.09.03", "2021,09,4", "2021:09::5",
+      "2021 09   6", "21-09-07", "21/09/08", "21.09.9", "21,09,10", "21:09:11",
+      # not yet working for strings with no separators, like "20210917", "210918" or "2021Sep19
+      # no separators and %b or %B are even more complicated (and they work in
+      # lubridate). not to mention locale
+      NA
+    ),
+    string_dmy = c(
+      "1-09-2021", "2/09//2021", "03.09.2021", "04,09,2021", "5:::09:2021",
+      "6  09  2021", "07-09-21", "08/09/21", "9.09.21", "10,09,21", "11:09:21",
+      # not yet working for strings with no separators, like "10092021", "100921",
+      NA
+    ),
+    string_mdy = c(
+      "09-01-2021", "09/2/2021", "09.3.2021", "09,04,2021", "09:05:2021",
+      "09 6 2021", "09-7-21", "09/08/21", "09.9.21", "09,10,21", "09:11:21",
+      # not yet working for strings with no separators, like "09102021", "091021",
+      NA
+    )
+  )
+
+  test_dates_locale <- tibble::tibble(
+    string_ymd = c(
+      "2021 Sep 12", "2021 September 13", "21 Sep 14", "21 September 15", NA
+    ),
+    string_dmy = c(
+      "12 Sep 2021", "13 September 2021", "14 Sep 21", "15 September 21", NA
+    ),
+    string_mdy = c(
+      "Sep 12 2021", "September 13 2021", "Sep 14 21", "September 15 21", NA
+    )
+  )
+
+  skip_if_not_available("re2")
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        parsed_date_ymd = parse_date_time(string_ymd, orders = "ymd"),
+        parsed_date_dmy = parse_date_time(string_dmy, orders = "dmy"),
+        parsed_date_mdy = parse_date_time(string_mdy, orders = "mdy")
+      ) %>%
+      collect(),
+    test_dates
+  )
+
+  # locale not working properly on windows
+  # TODO revisit once https://issues.apache.org/jira/browse/ARROW-16399 is done
+  skip_on_os("windows")

Review Comment:
   If you ensure that the locale is set to C here, do these tests pass?



##########
r/R/dplyr-funcs-datetime.R:
##########
@@ -357,6 +357,62 @@ register_bindings_duration <- function() {
     delta <- delta$cast(int64())
     start + delta$cast(duration("s"))
   })
+  register_binding("fast_strptime", function(x,
+                                             format,
+                                             tz = "UTC",
+                                             lt = TRUE,
+                                             cutoff_2000 = 68L,
+                                             unit = "s") {
+    # TODO support multiple formats once
+    # https://issues.apache.org/jira/browse/ARROW-15665 is done

Review Comment:
   Do you plan to do this in this PR? 



##########
r/R/dplyr-funcs-datetime.R:
##########
@@ -357,6 +357,62 @@ register_bindings_duration <- function() {
     delta <- delta$cast(int64())
     start + delta$cast(duration("s"))
   })
+  register_binding("fast_strptime", function(x,
+                                             format,
+                                             tz = "UTC",
+                                             lt = TRUE,
+                                             cutoff_2000 = 68L,
+                                             unit = "s") {
+    # TODO support multiple formats once
+    # https://issues.apache.org/jira/browse/ARROW-15665 is done
+    if (length(format) > 1) {
+      arrow_not_supported("multiple values for `format`")
+    }
+
+    if (!missing(tz)) {
+      arrow_not_supported("Time zone argument")
+    }
+    # `lt` controls the output `lt = TRUE` returns a POSIXlt (which doesn't play
+    # well with mutate, for example)
+    if (lt) {
+      arrow_not_supported("`lt = TRUE` argument")
+    }
+
+    if (cutoff_2000 != 68L) {
+      arrow_not_supported("`cutoff_2000` != 68L argument")
+    }
+
+    unit <- make_valid_time_unit(unit, c(valid_time64_units, valid_time32_units))
+
+    build_expr("strptime", x, options = list(format = format, unit = unit))

Review Comment:
   If you used `call_binding` here, you could defer the `tz` + `unit` checking to that binding. 



##########
r/R/dplyr-funcs-datetime.R:
##########
@@ -539,3 +595,21 @@ binding_as_date_numeric <- function(x, origin = "1970-01-01") {
 
   x
 }
+
+build_formats <- function(orders) {
+  year_chars <- sprintf("%%%s", c("y", "Y"))
+  month_chars <- sprintf("%%%s", c("m", "B", "b"))
+  day_chars <- sprintf("%%%s", "d")

Review Comment:
   These each expand into `c("%y", "%Y")`, `c("%m", "%B", "%b")`, and `"%d"`, yeah? I know this is _slightly_ more DRY to do it with `sprintf()`, but it would be significantly shorter + less needing to think about it if we just wrote them out, no?



##########
r/R/dplyr-funcs-datetime.R:
##########
@@ -357,6 +357,62 @@ register_bindings_duration <- function() {
     delta <- delta$cast(int64())
     start + delta$cast(duration("s"))
   })
+  register_binding("fast_strptime", function(x,
+                                             format,
+                                             tz = "UTC",
+                                             lt = TRUE,
+                                             cutoff_2000 = 68L,
+                                             unit = "s") {
+    # TODO support multiple formats once
+    # https://issues.apache.org/jira/browse/ARROW-15665 is done
+    if (length(format) > 1) {
+      arrow_not_supported("multiple values for `format`")
+    }
+
+    if (!missing(tz)) {
+      arrow_not_supported("Time zone argument")
+    }

Review Comment:
   Is there a jira for the thing that is preventing us from implementing this? I know it's not (yet) implemented in the strptime binding either, but I don't see any references to issues that are preventing that?



##########
r/R/dplyr-funcs-datetime.R:
##########
@@ -357,6 +357,62 @@ register_bindings_duration <- function() {
     delta <- delta$cast(int64())
     start + delta$cast(duration("s"))
   })
+  register_binding("fast_strptime", function(x,
+                                             format,
+                                             tz = "UTC",
+                                             lt = TRUE,
+                                             cutoff_2000 = 68L,
+                                             unit = "s") {
+    # TODO support multiple formats once
+    # https://issues.apache.org/jira/browse/ARROW-15665 is done
+    if (length(format) > 1) {
+      arrow_not_supported("multiple values for `format`")
+    }
+
+    if (!missing(tz)) {
+      arrow_not_supported("Time zone argument")
+    }
+    # `lt` controls the output `lt = TRUE` returns a POSIXlt (which doesn't play
+    # well with mutate, for example)
+    if (lt) {
+      arrow_not_supported("`lt = TRUE` argument")
+    }
+
+    if (cutoff_2000 != 68L) {
+      arrow_not_supported("`cutoff_2000` != 68L argument")
+    }
+
+    unit <- make_valid_time_unit(unit, c(valid_time64_units, valid_time32_units))
+
+    build_expr("strptime", x, options = list(format = format, unit = unit))
+  })
+  register_binding("parse_date_time", function(x,
+                                               orders,
+                                               tz = "UTC") {
+
+    # make all separators (non-letters and non-numbers) into "-"
+    x <- call_binding("gsub", "[^A-Za-z0-9]", "-", x)
+    # collapse multiple separators into a single one
+    x <- call_binding("gsub", "-{2,}", "-", x)
+
+    # TODO figure out how to parse strings that have no separators)
+    # we could insert separators at the "likely" positions, but it might be
+    # tricky given the possible combinations between dmy formats + locale
+
+    # each order is translated into 6 possible formats
+    formats <- build_formats(orders)
+    coalesce_output <- build_expr(
+      "coalesce",
+      build_expr("strptime", x, options = list(format = formats[1], unit = 0L, error_is_null = TRUE)),
+      build_expr("strptime", x, options = list(format = formats[2], unit = 0L, error_is_null = TRUE)),
+      build_expr("strptime", x, options = list(format = formats[3], unit = 0L, error_is_null = TRUE)),
+      build_expr("strptime", x, options = list(format = formats[4], unit = 0L, error_is_null = TRUE)),
+      build_expr("strptime", x, options = list(format = formats[5], unit = 0L, error_is_null = TRUE)),
+      build_expr("strptime", x, options = list(format = formats[6], unit = 0L, error_is_null = TRUE))
+    )
+
+    build_expr("assume_timezone", coalesce_output, options = list(timezone = tz))

Review Comment:
   We won't support adding time components to these yet, will we? We should error so that folks know what's going on if they give a full time string. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org