You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/18 20:31:57 UTC

[GitHub] [arrow] jonkeane commented on a diff in pull request #12757: ARROW-14944 [R] Implement `lubridate::make_difftime()`

jonkeane commented on code in PR #12757:
URL: https://github.com/apache/arrow/pull/12757#discussion_r852389088


##########
r/R/dplyr-funcs-datetime.R:
##########
@@ -351,6 +351,25 @@ register_bindings_duration <- function() {
     delta <- delta$cast(int64())
     start + delta$cast(duration("s"))
   })
+    register_binding("make_difftime", function(num = NULL,
+                                             units = "secs",
+                                             ...) {
+    if (units != "secs") {
+      abort("`make_difftime()` with units other than 'secs' not supported in Arrow")
+    }
+
+    chunks <- list(...)
+
+    if (is.null(num)) {
+      duration <- duration_from_chunks(chunks)
+    } else if (length(chunks) == 0) {
+      duration <- num
+    } else {
+      abort("`make_difftime()` with both `num` and `...` not supported in Arrow")
+    }
+    duration <- build_expr("cast", duration, options = cast_options(to_type = int64()))

Review Comment:
   I've seen this line in a few PRs recently — this is to get around the fact that we do not have float durations, yeah? I wonder if we should put that into a helper function. Not so much to save lines of code, but to encapsulate that we have this one workaround that we're using in all of these places. 
   
   Then if that workaround is not necessary (or if we find a better one) we can change it in one place. We don't need to do that here, but it would be good to make a Jira for that if we do want to.



##########
r/tests/testthat/test-dplyr-funcs-datetime.R:
##########
@@ -1253,3 +1253,89 @@ test_that("`decimal_date()` and `date_decimal()`", {
     ignore_attr = "tzone"
   )
 })
+
+test_that("make_difftime()", {
+  test_df <- tibble(
+    seconds = c(3, 4, 5, 6),
+    minutes = c(1.5, 2.3, 4.5, 6.7),
+    hours = c(2, 3, 4, 5),
+    days = c(6, 7, 8, 9),
+    weeks = c(1, 3, 5, NA),
+    number = 10:13
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        duration_from_parts = make_difftime(
+          second = seconds,
+          minute = minutes,
+          hour = hours,
+          day = days,
+          week = weeks,
+          units = "secs"
+        ),
+        duration_from_num = make_difftime(
+          num = number,
+          units =  "secs"
+        ),
+        duration_from_r_num = make_difftime(
+          num = 154,
+          units = "secs"
+        ),
+        duration_from_r_parts = make_difftime(
+          minute = 45,
+          day = 2,
+          week = 4,
+          units = "secs"
+        )
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # named difftime parts other than `second`, `minute`, `hour`, `day` and `week`
+  # are not supported
+  expect_error(
+    expect_warning(
+      test_df %>%
+        arrow_table() %>%
+        mutate(
+          err_difftime = make_difftime(month = 2)
+        ) %>%
+        collect()
+    )
+  )
+
+  # units other than "secs" not supported since they are the only ones in common
+  # between R and Arrow
+  compare_dplyr_binding(
+    .input %>%
+      mutate(error_difftime = make_difftime(num = number, units = "mins")) %>%
+      collect(),
+    test_df,
+    warning = TRUE
+  )
+
+  # constructing a difftime from both `num` and parts passed through `...` while
+  # possible with the lubridate function (resulting in a concatenation of the 2
+  # resulting objects), it errors in a dplyr context
+  expect_error(
+    expect_warning(
+      test_df %>%
+        arrow_table() %>%
+        mutate(
+          duration_from_num_and_parts = make_difftime(
+            num = number,
+            second = seconds,
+            minute = minutes,
+            hour = hours,
+            day = days,
+            week = weeks,
+            units = "secs"
+          )
+        ) %>%
+        collect()
+    )
+  )

Review Comment:
   Since this is an R-based error, we should assert what the warning is.



##########
r/tests/testthat/test-dplyr-funcs-datetime.R:
##########
@@ -1253,3 +1253,89 @@ test_that("`decimal_date()` and `date_decimal()`", {
     ignore_attr = "tzone"
   )
 })
+
+test_that("make_difftime()", {
+  test_df <- tibble(
+    seconds = c(3, 4, 5, 6),
+    minutes = c(1.5, 2.3, 4.5, 6.7),
+    hours = c(2, 3, 4, 5),
+    days = c(6, 7, 8, 9),
+    weeks = c(1, 3, 5, NA),
+    number = 10:13
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        duration_from_parts = make_difftime(
+          second = seconds,
+          minute = minutes,
+          hour = hours,
+          day = days,
+          week = weeks,
+          units = "secs"
+        ),
+        duration_from_num = make_difftime(
+          num = number,
+          units =  "secs"
+        ),
+        duration_from_r_num = make_difftime(
+          num = 154,
+          units = "secs"
+        ),
+        duration_from_r_parts = make_difftime(
+          minute = 45,
+          day = 2,
+          week = 4,
+          units = "secs"
+        )
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # named difftime parts other than `second`, `minute`, `hour`, `day` and `week`
+  # are not supported
+  expect_error(
+    expect_warning(
+      test_df %>%
+        arrow_table() %>%
+        mutate(
+          err_difftime = make_difftime(month = 2)
+        ) %>%
+        collect()
+    )
+  )

Review Comment:
   Since this is an R-based error (right?), we should assert what the warning is.



##########
r/R/dplyr-funcs-datetime.R:
##########
@@ -351,6 +351,25 @@ register_bindings_duration <- function() {
     delta <- delta$cast(int64())
     start + delta$cast(duration("s"))
   })
+    register_binding("make_difftime", function(num = NULL,
+                                             units = "secs",
+                                             ...) {
+    if (units != "secs") {
+      abort("`make_difftime()` with units other than 'secs' not supported in Arrow")
+    }
+
+    chunks <- list(...)
+
+    if (is.null(num)) {
+      duration <- duration_from_chunks(chunks)
+    } else if (length(chunks) == 0) {
+      duration <- num
+    } else {
+      abort("`make_difftime()` with both `num` and `...` not supported in Arrow")
+    }

Review Comment:
   This gets the behavior we want, though the conditions on the if/else make it a little bit funny to follow: If `num` is `NULL`, then do something with `chunks`, if `chunks` don't exist do something with `NULL`.
   
   What if instead we were a bit more explicit and did:
   
   ```
   if (!is.null(num) && length(chunks > 0)) {
      abort("`make_difftime()` with both `num` and `...` not supported in Arrow")
   }
   
   if (!is.null(num)) {
     duration <- num
   } else {
     duration <- duration_from_chunks(chunks)
   }
   ```
   
   or flip the last if/else around to do the chunks first.



##########
r/R/dplyr-funcs-datetime.R:
##########
@@ -372,3 +391,36 @@ binding_format_datetime <- function(x, format = "", tz = "", usetz = FALSE) {
 
   build_expr("strftime", x, options = list(format = format, locale = Sys.getlocale("LC_TIME")))
 }
+
+# this is a helper function used for creating a difftime / duration objects from
+# several of the accepted pieces (second, minute, hour, day, week)
+duration_from_chunks <- function(chunks) {
+  accepted_chunks <- c("second", "minute", "hour", "day", "week")
+  matched_chunks <- accepted_chunks[pmatch(names(chunks), accepted_chunks, duplicates.ok = TRUE)]
+
+  if (any(is.na(matched_chunks))) {
+    abort(
+      paste0(
+        "named `difftime` units other than: ",
+        oxford_paste(accepted_chunks, quote_symbol = "`"),
+        " not supported in Arrow. \nInvalid `difftime` parts: ",
+        oxford_paste(names(chunks[is.na(matched_chunks)]), quote_symbol = "`")
+      )
+    )
+  }
+
+  matched_chunks <- matched_chunks[!is.na(matched_chunks)]
+
+  chunks <- chunks[matched_chunks]
+  chunk_duration <- c(
+    "second" = 1L,
+    "minute" = 60L,
+    "hour" = 3600L,
+    "day" = 86400L,
+    "week" = 604800L
+  )
+  # transform the duration of each chunk in seconds and add everything together
+  chunks_total <- purrr::imap(chunks, ~.x * chunk_duration[[.y]]) %>%
+    purrr::reduce(`+`)
+  chunks_total

Review Comment:
   ```suggestion
     chunks_total <- purrr::imap(chunks, ~.x * chunk_duration[[.y]])
     purrr::reduce(chunks_total, `+`)
   ```
   
   FYI: we don't actually have `%>%` defined in Arrow
   
   https://github.com/apache/arrow/runs/6019803686?check_suite_focus=true#step:7:4265
   ```
   Undefined global functions or variables:
     %>%
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org