You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2021/11/08 19:48:00 UTC

[jira] [Commented] (ARROW-14471) [R] Implement lubridate's date/time parsing functions

    [ https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440722#comment-17440722 ] 

Dewey Dunnington commented on ARROW-14471:
------------------------------------------

I did a bit of looking into this...lubridate uses a [custom C parser for its order-based datetime parsers|https://github.com/tidyverse/lubridate/blob/main/src/tparse.c#L46-L391]. That said, its functionality can be approximated by {{{}coalesce(strptime(dt_string, "format1"), strptime(dt_string, "format2"), ...){}}}. Is it worth translating the functions with an approximation that handles most of the use cases?

Some testing that might be useful when putting together a PR:
{code:r}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

test_dates <- tibble::tibble(
  string_ymd = c("2021-09-10", "2021/09/10", "20210910", "2021 Sep 10", "2021 September 10", NA),
  string_dmy = c("10-09-2021", "10/09/2021", "10092021", "10 Sep 2021", "10 September 2021", NA),
  string_mdy = c("09-10-2021", "09/10/2021", "09102021", "Sep 10 2021", "September 10 2021", NA),
  date = c(rep(as.Date("2021-09-10"), 5), NA),
  date_midnight = c(rep(as.POSIXct("2021-09-10 00:00:00", tz = "UTC"), 5), NA)
)

# these get dropped by as.POSIXct if the system tz is UTC?
attr(test_dates$date_midnight, "tzone") <- "UTC"

test_datetimes <- tibble::tibble(
  string_ymd_hms = stringr::str_c(test_dates$string_ymd, "01:23:45"),
  string_dmy_hms = stringr::str_c(test_dates$string_dmy, "01:23:45"),
  string_mdy_hms = stringr::str_c(test_dates$string_mdy, "01:23:45"),
  string_ymd_hm = stringr::str_c(test_dates$string_ymd, "01:23"),
  string_dmy_hm = stringr::str_c(test_dates$string_dmy, "01:23"),
  string_mdy_hm = stringr::str_c(test_dates$string_mdy, "01:23"),
  string_ymd_h = stringr::str_c(test_dates$string_ymd, "01"),
  string_dmy_h = stringr::str_c(test_dates$string_dmy, "01"),
  string_mdy_h = stringr::str_c(test_dates$string_mdy, "01"),
  date_second = c(rep(as.POSIXct("2021-09-10 01:23:45", tz = "UTC"), 5), NA),
  date_minute = c(rep(as.POSIXct("2021-09-10 01:23", tz = "UTC"), 5), NA),
  date_hour = c(rep(as.POSIXct("2021-09-10", tz = "UTC") + 60 * 60, 5), NA)
)

# these get dropped by as.POSIXct if the system tz is UTC?
attr(test_datetimes$date_second, "tzone") <- "UTC"
attr(test_datetimes$date_minute, "tzone") <- "UTC"
attr(test_datetimes$date_hour, "tzone") <- "UTC"

# tests with lubridate, R eval
library(testthat, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

expect_identical(ymd(test_dates$string_ymd), test_dates$date)
expect_identical(dmy(test_dates$string_dmy), test_dates$date)
expect_identical(mdy(test_dates$string_mdy), test_dates$date)

expect_identical(ymd(test_dates$string_ymd, tz = "UTC"), test_dates$date_midnight)
expect_identical(dmy(test_dates$string_dmy, tz = "UTC"), test_dates$date_midnight)
expect_identical(mdy(test_dates$string_mdy, tz = "UTC"), test_dates$date_midnight)

expect_identical(
  ymd_hms(test_datetimes$string_ymd_hms, tz = "UTC"),
  test_datetimes$date_second
)
expect_identical(
  dmy_hms(test_datetimes$string_dmy_hms, tz = "UTC"),
  test_datetimes$date_second
)
expect_identical(
  mdy_hms(test_datetimes$string_mdy_hms, tz = "UTC"),
  test_datetimes$date_second
)

expect_identical(
  ymd_hm(test_datetimes$string_ymd_hm, tz = "UTC"),
  test_datetimes$date_minute
)
expect_identical(
  dmy_hm(test_datetimes$string_dmy_hm, tz = "UTC"),
  test_datetimes$date_minute
)
expect_identical(
  mdy_hm(test_datetimes$string_mdy_hm, tz = "UTC"),
  test_datetimes$date_minute
)

expect_identical(
  ymd_h(test_datetimes$string_ymd_h, tz = "UTC"),
  test_datetimes$date_hour
)
expect_identical(
  dmy_h(test_datetimes$string_dmy_h, tz = "UTC"),
  test_datetimes$date_hour
)
expect_identical(
  mdy_h(test_datetimes$string_mdy_h, tz = "UTC"),
  test_datetimes$date_hour
)
{code}

> [R] Implement lubridate's date/time parsing functions
> -----------------------------------------------------
>
>                 Key: ARROW-14471
>                 URL: https://issues.apache.org/jira/browse/ARROW-14471
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Nicola Crane
>            Assignee: Dewey Dunnington
>            Priority: Major
>             Fix For: 7.0.0
>
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
> 	
> Parse date-times with year, month, and day, hour, minute, and second components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
> 	



--
This message was sent by Atlassian Jira
(v8.20.1#820001)