You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/16 22:29:40 UTC

[GitHub] [arrow] thisisnic commented on a change in pull request #12154: ARROW-14821: [R] Implement bindings for lubridate's floor_date, ceiling_date, and round_date

thisisnic commented on a change in pull request #12154:
URL: https://github.com/apache/arrow/pull/12154#discussion_r785521592



##########
File path: r/R/util.R
##########
@@ -209,3 +209,74 @@ handle_csv_read_error <- function(e, schema) {
 
   abort(e)
 }
+
+
+parse_period_unit <- function(x) {
+
+  # the regexp matches against fractional units, but per lubridate
+  # supports integer multiples of a known unit only
+  match_info <- regexpr(
+    pattern = " *(?<multiple>[0-9.,]+)? *(?<unit>[^ \t\n]+)",
+    text = x[[1]],
+    perl = TRUE
+  )
+
+  capture_start <- attr(match_info, "capture.start")
+  capture_length <- attr(match_info, "capture.length")
+  capture_end <- capture_start + capture_length - 1L
+
+  str_unit <- substr(x, capture_start[[2]], capture_end[[2]])
+  str_multiple <- substr(x, capture_start[[1]], capture_end[[1]])
+
+  known_units <- c("nanosecond", "microsecond", "millisecond", "second",
+                   "minute", "hour", "day", "week", "month", "quarter", "year")
+
+  # match the period unit
+  str_unit_start <- substr(str_unit, 1, 3)
+  unit <- as.integer(pmatch(str_unit_start, known_units)) - 1L
+
+  if(any(is.na(unit))) {
+    abort(sprintf("Unknown unit '%s'", str_unit))
+  }
+
+  # empty string in multiple interpreted as 1
+  if(capture_length[[1]] == 0) {
+    multiple <- 1L
+
+  } else {
+
+    # special cases: interpret fractions of 1 second as integer
+    # multiples of nanoseconds, microseconds, or milliseconds
+    # to mirror lubridate syntax
+    multiple <- as.numeric(str_multiple)
+
+    if(unit == 3L & multiple < 10^-6) {
+      unit <- 0L
+      multiple <- 10^9 * multiple
+    }
+    if(unit == 3L & multiple < 10^-3) {
+      unit <- 1L
+      multiple <- 10^6 * multiple
+    }
+    if(unit == 3L & multiple < 1) {
+      unit <- 2L
+      multiple <- 10^3 * multiple
+    }
+
+    multiple <- as.integer(multiple)
+  }
+
+  # more special cases: lubridate imposes sensible maximum
+  # values on the number of seconds, minutes and hours
+  if(unit == 3L & multiple > 60) {
+    abort("Rounding with second > 60 is not supported")
+  }

Review comment:
       I can see where you are coming from here.  I guess, generally, the aim when implementing the functions for non-standard evaluation code blocks is to (as closely as possible unless there's good reason not to!) replicate the original.  However, there are some rare cases when I've seen us not do that; for example, our mapping to `lubridate::second()` supports nanoseconds as we didn't see the harm in having more precise results, and so the unit test for that binding looks like this: https://github.com/apache/arrow/blob/0c1fd88953585485b772dfd405bbc5b1b5417324/r/tests/testthat/test-dplyr-funcs-datetime.R#L474-L483 
   
   That said, I also can't think of a reason someone else's code would depend on raising an error *but* I'm not confident that that would mean it won't - people have all sorts of unusual workflows.  
   
   There are multiple ways of calling Arrow C++ compute functions from R via the `dplyr`/non-standard evaluation code - either using the bindings to the `lubridate` etc equivalent, or calling the Arrow C++ compute function directly by prefixing the function name with `arrow_` (so in this case it'd be `arrow_round_temporal`; if you've not come across this before, check it out here: https://arrow.apache.org/cookbook/r/manipulating-data---tables.html#use-arrow-functions-in-dplyr-verbs-in-arrow).
   
   So if people use that directly they'd still have access to the full functionality. I'm not sure what to suggest here - would be good to get input from @jonkeane or @nealrichardson 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org