You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/07 13:58:55 UTC

[GitHub] [arrow] dragosmg opened a new pull request #12353: ARROW-14471 [R] Implement lubridate's date/time parsing functions

dragosmg opened a new pull request #12353:
URL: https://github.com/apache/arrow/pull/12353


   Once this PR is merged arrow queries would be able to parse dates, date times, and periods with hour, minute, and second components correctly. For example,
   ```r
   df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03")) 
   df %>% 
     record_batch() %>% 
     mutate(y = ymd(x)) %>% 
     collect()
   ```
   would be equivalent of 
   ```r
   df %>%
      mutate(y = lubridate::ymd(x)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12353: ARROW-14471: [R] Implement lubridate's date/time parsing functions

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12353:
URL: https://github.com/apache/arrow/pull/12353#discussion_r804685267



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -148,4 +148,56 @@ register_bindings_datetime <- function() {
     !call_binding("am", x)
   })
 
+  register_binding("ymd", function(x) {
+    format_map <-
+      list(
+        ymd_hyphen1 = "%Y-%m-%d",
+        ymd_hyphen2 = "%y-%m-%d",
+        ymd_hyphen3 = "%Y-%B-%d",
+        ymd_hyphen4 = "%y-%B-%d",
+        ymd_hyphen5 = "%Y-%b-%d",
+        ymd_hyphen6 = "%y-%b-%d",

Review comment:
       https://issues.apache.org/jira/browse/ARROW-15659, but this is for the R side only. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12353: ARROW-14471: [R] Implement lubridate's date/time parsing functions

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12353:
URL: https://github.com/apache/arrow/pull/12353#discussion_r803893141



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -148,4 +148,56 @@ register_bindings_datetime <- function() {
     !call_binding("am", x)
   })
 
+  register_binding("ymd", function(x) {
+    format_map <-
+      list(
+        ymd_hyphen1 = "%Y-%m-%d",
+        ymd_hyphen2 = "%y-%m-%d",
+        ymd_hyphen3 = "%Y-%B-%d",
+        ymd_hyphen4 = "%y-%B-%d",
+        ymd_hyphen5 = "%Y-%b-%d",
+        ymd_hyphen6 = "%y-%b-%d",

Review comment:
       >  coalesce() needs the first call to call_binding("strptime") to return NA in order to move on to the second one and so on. Instead the first call errors and the second one never get evaluated. tryCatch() doesn't seem to work in this context.
   
   Good point! Is there a jira to make that a possibility for the `strptime` binding (or in the C++ itself, which is the ultimate place it'll really need to be...)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12353: ARROW-14471: [R] Implement lubridate's date/time parsing functions

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12353:
URL: https://github.com/apache/arrow/pull/12353#discussion_r802008905



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -711,3 +711,35 @@ test_that("am/pm mirror lubridate", {
   )
 
 })
+
+test_that("date/time parsing / ymd() and `-` separator", {
+  test_dates <- tibble::tibble(
+    string_ymd = c(
+      "2021-09-10", "2021/09/10", "2021.09.10", "2021,09,10", "2021:09:10",
+      "20210910", "2021 Sep 10", "2021 September 10", "21-09-10", "21/09/10",
+      "21.09.10", "21,09,10", "21:09:10", "210910", "21 Sep 10",
+      "21 September 10", NA
+    ),
+    string_dmy = c(
+      "10-09-2021", "10/09/2021", "10.09.2021", "10,09,2021", "10:09:2021",
+      "10092021", "10 Sep 2021", "10 September 2021", "10-09-21", "10/09/21",
+      "10.09.21", "10,09,21", "10:09:21", "100921", "10 Sep 21",
+      "10 September 21", NA
+    ),
+    string_mdy = c(
+      "09-10-2021", "09/10/2021", "09.10.2021", "09,10,2021", "09:10:2021",
+      "09102021", "Sep 10 2021", "September 10 2021", "09-10-21", "09/10/21",
+      "09.10.21", "09,10,21", "09:10:21", "091021", "Sep 10 21",
+      "September 10 21", NA
+    ),
+    date = c(rep(as.Date("2021-09-10"), 16), NA),
+    date_midnight = c(rep(as.POSIXct("2021-09-10 00:00:00", tz = "UTC"), 16), NA)
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(x = ymd(string_ymd)) %>%
+      collect(),
+    test_dates
+  )

Review comment:
       We should also have a check that if the format fails, we get `NA`s out — so long as that's possible)
   
   It looks like right now, it throws an error for the whole conversion: https://github.com/apache/arrow/runs/5094263843?check_suite_focus=true#step:6:22119




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12353: ARROW-14471 [R] Implement lubridate's date/time parsing functions

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12353:
URL: https://github.com/apache/arrow/pull/12353#issuecomment-1031497310


   https://issues.apache.org/jira/browse/ARROW-14471


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12353: ARROW-14471: [R] Implement lubridate's date/time parsing functions

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12353:
URL: https://github.com/apache/arrow/pull/12353#discussion_r803884693



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -148,4 +148,56 @@ register_bindings_datetime <- function() {
     !call_binding("am", x)
   })
 
+  register_binding("ymd", function(x) {
+    format_map <-
+      list(
+        ymd_hyphen1 = "%Y-%m-%d",
+        ymd_hyphen2 = "%y-%m-%d",
+        ymd_hyphen3 = "%Y-%B-%d",
+        ymd_hyphen4 = "%y-%B-%d",
+        ymd_hyphen5 = "%Y-%b-%d",
+        ymd_hyphen6 = "%y-%b-%d",

Review comment:
       I think I will move the focus on `parse_date_time()` of which most other date/time parsing functions are a particular case of. `ymd(x)` becomes `parse_date_time(x, orders = "ymd")`, but the problems highlighted here still remain.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12353: ARROW-14471: [R] Implement lubridate's date/time parsing functions

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12353:
URL: https://github.com/apache/arrow/pull/12353#discussion_r803882802



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -148,4 +148,56 @@ register_bindings_datetime <- function() {
     !call_binding("am", x)
   })
 
+  register_binding("ymd", function(x) {
+    format_map <-
+      list(
+        ymd_hyphen1 = "%Y-%m-%d",
+        ymd_hyphen2 = "%y-%m-%d",
+        ymd_hyphen3 = "%Y-%B-%d",
+        ymd_hyphen4 = "%y-%B-%d",
+        ymd_hyphen5 = "%Y-%b-%d",
+        ymd_hyphen6 = "%y-%b-%d",

Review comment:
       Reducing the formats works (I might have to implement it in several steps to handle `" "` and multiple instances of the same separator). There are a couple of outstanding issues:
   
   * `coalesce()` needs the first call to `call_binding("strptime")` to return `NA` in order to move on to the second one and so on. Instead the first call errors and the second one never get evaluated. `tryCatch()` doesn't seem to work in this context.
   * short (`"%y"`) vs long (`"%Y"`) years. If the string contains a short year (e.g. "19") libarrrow will not error for the first format, but instead return `0019`.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12353: ARROW-14471: [R] Implement lubridate's date/time parsing functions

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12353:
URL: https://github.com/apache/arrow/pull/12353#discussion_r802006466



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -148,4 +148,56 @@ register_bindings_datetime <- function() {
     !call_binding("am", x)
   })
 
+  register_binding("ymd", function(x) {
+    format_map <-
+      list(
+        ymd_hyphen1 = "%Y-%m-%d",
+        ymd_hyphen2 = "%y-%m-%d",
+        ymd_hyphen3 = "%Y-%B-%d",
+        ymd_hyphen4 = "%y-%B-%d",
+        ymd_hyphen5 = "%Y-%b-%d",
+        ymd_hyphen6 = "%y-%b-%d",

Review comment:
       Could we use one set of 6 like this, and then pre-process the strings with a regex like was suggested in https://issues.apache.org/jira/browse/ARROW-14471?focusedCommentId=17446011&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17446011 ?
   
   Something kind of like:
   
   ```
   format_map <- list(
           ymd_hyphen1 = "%Y-%m-%d",
           ymd_hyphen2 = "%y-%m-%d",
           ymd_hyphen3 = "%Y-%B-%d",
           ymd_hyphen4 = "%y-%B-%d",
           ymd_hyphen5 = "%Y-%b-%d",
           ymd_hyphen6 = "%y-%b-%d"
   )
   
   x <- gsub("[^A-Za-z0-9.]", "-", x)
   call_binding(
         "coalesce",
         call_binding("strptime", x, format = format_map[[1]], unit = "s"),
         call_binding("strptime", x, format = format_map[[2]], unit = "s"),
         call_binding("strptime", x, format = format_map[[3]], unit = "s"),
         call_binding("strptime", x, format = format_map[[4]], unit = "s"),
         call_binding("strptime", x, format = format_map[[5]], unit = "s"),
         call_binding("strptime", x, format = format_map[[6]], unit = "s")
       )
   ```
   
   You might need to use `call_binding("gsub", ...)` instead of being able to use it directly

##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -148,4 +148,56 @@ register_bindings_datetime <- function() {
     !call_binding("am", x)
   })
 
+  register_binding("ymd", function(x) {
+    format_map <-
+      list(
+        ymd_hyphen1 = "%Y-%m-%d",
+        ymd_hyphen2 = "%y-%m-%d",
+        ymd_hyphen3 = "%Y-%B-%d",
+        ymd_hyphen4 = "%y-%B-%d",
+        ymd_hyphen5 = "%Y-%b-%d",
+        ymd_hyphen6 = "%y-%b-%d",

Review comment:
       This would at least cut down on the number of formats we need to try and _mostly_ get the right answers




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org