You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/15 14:46:16 UTC

[GitHub] [arrow] AlenkaF opened a new pull request #10334: Arrow 12198: [R] bindings for strptime

AlenkaF opened a new pull request #10334:
URL: https://github.com/apache/arrow/pull/10334


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r641408098



##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -493,3 +493,81 @@ test_that("edge cases in string detection and replacement", {
     tibble(x = c("ABC"))
   )
 })
+
+test_that("strptime", {
+
+  t_string <- tibble(x = c("2018-10-07 19:04:05", NA))
+  t_stamp <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA))
+  t_stampPDT <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "PDT"), NA))

Review comment:
       I missed a typo in code, line 554: `t_stampPDT` should be used instead of `t_stamp`.
   
   But it relates to your [comment 1](https://github.com/apache/arrow/pull/10334/files#r640822991) and [comment 2](https://github.com/apache/arrow/pull/10334#pullrequestreview-670405891).
   
   I added example `t_stampPDT` to the test to see if I get a warning as `tz` agrument is given. I do but then data is pulled into `R`. Test now correctly fails as `lubridate` converts time to match PDT time zone. But then it should `stop()` as Neal [suggested](https://github.com/apache/arrow/pull/10334#discussion_r635544090) but I am not sure I know how to do that.
   
   Adding separate test to check if we error correctly could be something in the lines of:
   
   ```
   test_that("errors in strptime", {
     # Error when tz is passed
   
     x <- Expression$field_ref("x")
     expect_error(
       nse_funcs$strptime(x, tz = "PDT"),
       'Time zone argument not supported by Arrow'
     )
   })
   ```
   
   and then lines from [comment](https://github.com/apache/arrow/pull/10334/files#r640822991) are redundant.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kszucs commented on pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
kszucs commented on pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#issuecomment-846949270


   > @kszucs - could you you enable @AlenkaF to build on CI please?
   
   Enabled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r633703297



##########
File path: r/src/compute.cpp
##########
@@ -233,6 +233,12 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
                                      max_replacements);
   }
 
+  if (func_name == "strptime") {
+    using Options = arrow::compute::StrptimeOptions;

Review comment:
       Yeah looks like it. I made ARROW-12809 to evaluate whether that's correct, but for the purposes of the PR, it's fine.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r633329285



##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)

Review comment:
       I haven't figured out yet how to pass `tz` argument to `strptime_arrow ` wrapper function as it is written now. After going through your comments, making the necessary changes then it will not be an issue I think.

##########
File path: r/R/compute.R
##########
@@ -286,3 +286,8 @@ cast_options <- function(safe = TRUE, ...) {
   )
   modifyList(opts, list(...))
 }
+
+strptime_arrow <- function(..., format, unit){

Review comment:
       Got it.
   Arrow function uses `format` and `unit` as an FunctionOption if I understand correctly, haven't found `tz` yet.
   
   I think they should have defaults, yes: `format = "%Y-%m-%d %H:%M:%S"` and `unit = TimeUnit$MICRO/2L/"us"`. 

##########
File path: r/src/compute.cpp
##########
@@ -233,6 +233,12 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
                                      max_replacements);
   }
 
+  if (func_name == "strptime") {
+    using Options = arrow::compute::StrptimeOptions;

Review comment:
       Do I understand correctly that in `scalar_string.cc` line 1744 would suggest `StrptimeOptions` do not have `Defaults()` in C++?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
jonkeane commented on pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#issuecomment-848155516


   Oh, this was a fun one dig through and figure out what was going on. As I'm sure you've seen, the failure is only in the devel build, and _it turns out_ [that `all.equal.POSIXt()`](https://github.com/wch/r-source/blob/79298c499218846d14500255efd622b5021c10ec/src/library/base/R/all.equal.R#L492-L524) has [changed recently](https://github.com/wch/r-source/blob/trunk/doc/NEWS.Rd#L735-L739): https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17277 is the bug report that adds timezone checking. 
   
   Interestingly, [comment 4](https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17277#c4) seems to indicate `check.attributes` should be respected, thought it is not (currently) being respected. I've sent an email to the r-devel list and asking if `check.attributes` is supposed to be ignored in this case (but you can see a replication of this in the code below).
   
   For now, I think if you add `check.tzone = FALSE` to the `expect_equivalent()` calls that will fix this (and not cause problems in other versions).
   
   ```r
   > all.equal(
     list(lubridate::ymd_hms("2018-10-07 19:04:05", tz = NULL)),
     list(lubridate::ymd_hms("2018-10-07 19:04:05")),
     check.attributes = FALSE
   )
   [1] "Component 1: 'tzone' attributes are inconsistent ('' and 'UTC')"
   > all.equal(
     list(lubridate::ymd_hms("2018-10-07 19:04:05", tz = NULL)),
     list(lubridate::ymd_hms("2018-10-07 19:04:05")),
     check.tzone = FALSE
   )
   [1] TRUE
   > 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r641362129



##########
File path: r/R/dplyr-functions.R
##########
@@ -338,3 +338,24 @@ get_stringr_pattern_options <- function(pattern) {
 contains_regex <- function(string) {
   grepl("[.\\|()[{^$*+?]", string)
 }
+
+nse_funcs$strptime <- function(x, format = "%Y-%m-%d %H:%M:%S", tz = NULL, unit = 1L) {

Review comment:
       Oh yes, of course. Sorry about that.
   
   But would do "ms" as @Neal already mentioned in [ARROW-12809](https://issues.apache.org/jira/browse/ARROW-12809) to match with https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L1236.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r633704593



##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)

Review comment:
       You're right that strptime in the C++ library doesn't take a timezone argument. Maybe it is expected that if there is a timezone, it will be encoded in the string and parseable by strptime (with the right format string)? But this gets us into the always tricky area of timezone-aware vs. timezone-naive data. @jorisvandenbossche do you have any thoughts/experience with this code?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r633370744



##########
File path: r/R/compute.R
##########
@@ -286,3 +286,8 @@ cast_options <- function(safe = TRUE, ...) {
   )
   modifyList(opts, list(...))
 }
+
+strptime_arrow <- function(..., format, unit){

Review comment:
       Got it.
   Arrow function uses `format` and `unit` as an FunctionOption if I understand correctly, haven't found `tz` yet.
   
   I think they should have defaults, yes: `format = "%Y-%m-%d %H:%M:%S"` and `unit = TimeUnit$MICRO/2L/"us"`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r641508109



##########
File path: r/R/dplyr-functions.R
##########
@@ -338,3 +338,24 @@ get_stringr_pattern_options <- function(pattern) {
 contains_regex <- function(string) {
   grepl("[.\\|()[{^$*+?]", string)
 }
+
+nse_funcs$strptime <- function(x, format = "%Y-%m-%d %H:%M:%S", tz = NULL, unit = 1L) {
+  # Arrow uses unit for time parsing, strptime() does not.
+  # Arrow has no default option for strptime (format, unit),
+  # we suggest following format = "%Y-%m-%d %H:%M:%S", unit = MILLI/1L/"ms",
+  # (ARROW-12809)
+
+  # ParseTimestampStrptime currently ignores the timezone information (ARROW-12820).
+  # Stop if tz is provided.
+  if (is.character(tz)) {
+    arrow_not_supported("Time zone argument")
+  }
+
+  t_unit <- make_valid_time_unit(unit,c("s" = TimeUnit$SECOND, "ms" = TimeUnit$MILLI, "us" = TimeUnit$MICRO, "ns" = TimeUnit$NANO))

Review comment:
       You can add a comment if you would like, though it's not strictly necessary




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r633812540



##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)

Review comment:
       @rok do you know how the system `strptime` handles such a timezone if it is present? (the docs don't specify that, and the output struct doesn't have an entry for that information)
   
   > Maybe it is expected that if there is a timezone, it will be encoded in the string and parseable by strptime (with the right format string)?
   
   The problem here is that *if* a timezone is recorded in the Timestamp type's `tz` field, then the timestamp value is expected to be in UTC, and not localized to the timezone in question (which is what you get from just parsing the string without the timezone information). So basically that means the timestamp needs to be converted from the specific timezone to UTC (if `strptime` doesn't do that for us). And for now, that's not yet something we have implemented, I think (although at some point we probably should?)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#issuecomment-841683807


   https://issues.apache.org/jira/browse/ARROW-12198


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#issuecomment-852892348


   @jonkeane @nealrichardson I need approval for the check and the code is ready for another review round. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r634342467



##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)

Review comment:
       I've created [ARROW-12820](https://issues.apache.org/jira/browse/ARROW-12820) and referenced this discussion.
   In context of this issue we could leave a reference to ARROW-12820 in the tests and postpone the timezone functionally?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r633148838



##########
File path: r/R/compute.R
##########
@@ -286,3 +286,8 @@ cast_options <- function(safe = TRUE, ...) {
   )
   modifyList(opts, list(...))
 }
+
+strptime_arrow <- function(..., format, unit){
+  a <- collect_arrays_from_dots(list(...))

Review comment:
       I don't think you want `collect_arrays_from_dots` here. This function exists to support the base R behavior like:
   
   ```
   > sum(1, 2)
   [1] 3
   ```
   
   But `strptime` doesn't take `...` like that.

##########
File path: r/R/compute.R
##########
@@ -286,3 +286,8 @@ cast_options <- function(safe = TRUE, ...) {
   )
   modifyList(opts, list(...))
 }
+
+strptime_arrow <- function(..., format, unit){

Review comment:
       I'm not sure how useful this function is since it is a thin wrapper around `call_function()` and we can't set it as an S3 method.. More useful would be to add a version of this in the `nse_funcs` in dplyr-functions.R.
   
   In either case, we should match the `base::strptime()` signature: `function (x, format, tz = "")` with the possible addition of `unit` if that's an Arrow feature. 
   
   Also, should `format` and `unit` have default arguments?

##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)
+  timestamps <- Array$create(c(as.POSIXct("2018-10-07 19:04:05"), NA))
+  # array of parsed timestamps
+  parsed_timestamps <- strptime_arrow(time_strings, format = "%Y-%m-%d %H:%M:%S", unit = TimeUnit$MICRO)

Review comment:
       `unit` should also take human-friendly strings ("s", "ms", etc.); see how this is done in the `timestamp()` function in type.R.

##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)

Review comment:
       Why not? 

##########
File path: r/src/compute.cpp
##########
@@ -233,6 +233,12 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
                                      max_replacements);
   }
 
+  if (func_name == "strptime") {
+    using Options = arrow::compute::StrptimeOptions;

Review comment:
       Does `StrptimeOptions` have a `Defaults()` method in C++? If so, we should call it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r645808496



##########
File path: r/R/dplyr-functions.R
##########
@@ -338,3 +338,22 @@ get_stringr_pattern_options <- function(pattern) {
 contains_regex <- function(string) {
   grepl("[.\\|()[{^$*+?]", string)
 }
+
+nse_funcs$strptime <- function(x, format = "%Y-%m-%d %H:%M:%S", tz = NULL, unit = "ms") {
+  # Arrow uses unit for time parsing, strptime() does not.
+  # Arrow has no default option for strptime (format, unit),
+  # we suggest following format = "%Y-%m-%d %H:%M:%S", unit = MILLI/1L/"ms",
+  # (ARROW-12809)
+
+  # ParseTimestampStrptime currently ignores the timezone information (ARROW-12820).
+  # Stop if tz is provided.
+  if (is.character(tz)) {
+    arrow_not_supported("Time zone argument")
+  }
+
+  unit <- make_valid_time_unit(unit, c(valid_time64_units, valid_time32_units))
+
+  Expression$create("strptime", 
+                     x, 
+                     options = list(format = format, unit = unit))

Review comment:
       ```suggestion
     Expression$create("strptime", x, options = list(format = format, unit = unit))
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r634269138



##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)

Review comment:
       > I'm not sure, can you cast `timestamp[ms]` to `timestamp[ms, tz="UTC"]` or whatever (without modifying the values in the array, just to set the tz)?
   
   Casting actually works, but it's simply setting the tz and not changing the actual values, so it's not necessarily the behaviour you would expect (the behaviour would be correct if you assume the tz-naive data to be in UTC, but that seems a wrong assumption).
   
   > So timestamp's timezone is currently ignored and the local time is returned. It might be good to document this or even block `%z` and `%Z` to avoid surprises?
   
   Indeed, it seems we now simply ignore any timezone information in `strptime`:
   
   ```
   >>> pc.strptime(["2012-01-01 01:02:03+01:00"], format="%Y-%m-%d %H:%M:%S%z", unit="s")
   <pyarrow.lib.TimestampArray object at 0x7fad84855220>
   [
     2012-01-01 01:02:03
   ]
   >>> pc.strptime(["2012-01-01 01:02:03+01:00"], format="%Y-%m-%d %H:%M:%S%Z", unit="s").type
   TimestampType(timestamp[s])
   ```
   
   I can see some value in keeping that working, so you can at least parse strings that include such information (otherwise you would always get an error with arrow, or you would need to do some string preprocessing to be able to pass them to `strptime`). But then we certainly need to document that. 
   On the other hand, if we want to support it in the future, that would change behaviour and erroring now might then be better ..
   
   It seems that at least some `strptime` implementation support `%z` offsets, and store that in `tm->gmt_offset`, which we currently don't use (https://code.woboq.org/userspace/glibc/time/strptime_l.c.html#776). 
   At least supporting fixed offsets (`%z`) seems doable (and the result could then be a timestamp type with `tz="UTC"`), properly supporting `%Z` timezone names will be harder.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane closed pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
jonkeane closed pull request #10334:
URL: https://github.com/apache/arrow/pull/10334


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r641409188



##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -493,3 +493,81 @@ test_that("edge cases in string detection and replacement", {
     tibble(x = c("ABC"))
   )
 })
+
+test_that("strptime", {
+
+  t_string <- tibble(x = c("2018-10-07 19:04:05", NA))
+  t_stamp <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA))
+  t_stampPDT <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "PDT"), NA))
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x)
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S")
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S", unit = TimeUnit$NANO)
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S", unit = "s")
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S", tz="PDT")
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )

Review comment:
       See reply here https://github.com/apache/arrow/pull/10334/files#r641408098




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r633375368



##########
File path: r/src/compute.cpp
##########
@@ -233,6 +233,12 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
                                      max_replacements);
   }
 
+  if (func_name == "strptime") {
+    using Options = arrow::compute::StrptimeOptions;

Review comment:
       Do I understand correctly that in `scalar_string.cc` line 1744 would suggest `StrptimeOptions` do not have `Defaults()` in C++?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#issuecomment-849516929


   Thank you @jonkeane for your feedback.
   I corrected the code and would like to ask you for a review, if I may.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#issuecomment-846938027


   @kszucs - could you you enable @AlenkaF to build on CI please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r641369517



##########
File path: r/R/dplyr-functions.R
##########
@@ -338,3 +338,24 @@ get_stringr_pattern_options <- function(pattern) {
 contains_regex <- function(string) {
   grepl("[.\\|()[{^$*+?]", string)
 }
+
+nse_funcs$strptime <- function(x, format = "%Y-%m-%d %H:%M:%S", tz = NULL, unit = 1L) {
+  # Arrow uses unit for time parsing, strptime() does not.
+  # Arrow has no default option for strptime (format, unit),
+  # we suggest following format = "%Y-%m-%d %H:%M:%S", unit = MILLI/1L/"ms",
+  # (ARROW-12809)
+
+  # ParseTimestampStrptime currently ignores the timezone information (ARROW-12820).
+  # Stop if tz is provided.
+  if (is.character(tz)) {
+    arrow_not_supported("Time zone argument")
+  }
+
+  t_unit <- make_valid_time_unit(unit,c("s" = TimeUnit$SECOND, "ms" = TimeUnit$MILLI, "us" = TimeUnit$MICRO, "ns" = TimeUnit$NANO))

Review comment:
       No need to keep them both.
   Would add as comment though, to have the mapping visible?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r641645081



##########
File path: r/R/dplyr-functions.R
##########
@@ -338,3 +338,24 @@ get_stringr_pattern_options <- function(pattern) {
 contains_regex <- function(string) {
   grepl("[.\\|()[{^$*+?]", string)
 }
+
+nse_funcs$strptime <- function(x, format = "%Y-%m-%d %H:%M:%S", tz = NULL, unit = 1L) {
+  # Arrow uses unit for time parsing, strptime() does not.
+  # Arrow has no default option for strptime (format, unit),
+  # we suggest following format = "%Y-%m-%d %H:%M:%S", unit = MILLI/1L/"ms",
+  # (ARROW-12809)
+
+  # ParseTimestampStrptime currently ignores the timezone information (ARROW-12820).
+  # Stop if tz is provided.
+  if (is.character(tz)) {
+    arrow_not_supported("Time zone argument")
+  }
+
+  t_unit <- make_valid_time_unit(unit,c("s" = TimeUnit$SECOND, "ms" = TimeUnit$MILLI, "us" = TimeUnit$MICRO, "ns" = TimeUnit$NANO))
+
+  Expression$create("strptime", 
+                     x, 
+                     options = list(
+                      format = format, 

Review comment:
       nit: indentation seems off here (and you may just be able to do `options = list(format =...)` inline




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r633329285



##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)

Review comment:
       I haven't figured out yet how to pass `tz` argument to `strptime_arrow ` wrapper function as it is written now. After going through your comments, making the necessary changes then it will not be an issue I think.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10334: Arrow 12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#issuecomment-841669771


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r635091519



##########
File path: r/R/compute.R
##########
@@ -286,3 +286,8 @@ cast_options <- function(safe = TRUE, ...) {
   )
   modifyList(opts, list(...))
 }
+
+strptime_arrow <- function(..., format, unit){

Review comment:
       @nealrichardson I have trouble with calling `strptime()` function from `nse_funcs` - possible name collision with base. Am I missing something? Thank you!
   
   As for defaults I correct myself, `format` shouldn't have default argument - to match `base::strptime()` signature.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r642422637



##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -493,3 +493,81 @@ test_that("edge cases in string detection and replacement", {
     tibble(x = c("ABC"))
   )
 })
+
+test_that("strptime", {
+
+  t_string <- tibble(x = c("2018-10-07 19:04:05", NA))
+  t_stamp <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA))
+  t_stampPDT <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "PDT"), NA))
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x)
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S")
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(

Review comment:
       If I use `check.tzone = FALSE` they are equal. Should I use `expect_equal()` instead of `expect_equivalent()` in the test?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#issuecomment-847727477


   @jonkeane @nealrichardson - in [failed test](https://github.com/apache/arrow/pull/10334/checks?check_run_id=2655705142#step:8:14756) `expect_equivalent()` fails due to time zone being stored in `lubridate` timestamp but should ignore attributes if I understand correctly. I am running out of ideas how to solve this one :(
   
   Thank you for the help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#issuecomment-846985517


   > > @kszucs - could you you enable @AlenkaF to build on CI please?
   > 
   > Enabled.
   
   Thanks, I guess I need help (approval for CI build) once more =) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r635544090



##########
File path: r/R/compute.R
##########
@@ -286,3 +286,8 @@ cast_options <- function(safe = TRUE, ...) {
   )
   modifyList(opts, list(...))
 }
+
+strptime_arrow <- function(..., format, unit){

Review comment:
       What I was suggesting was
   
   ```r
   nse_funcs$strptime <- function(x, format = "%Y-%m-%d %H:%M:%S", tz = "", unit = "ms") {
   
   }
   ```
   
   following the model of the other functions there that build Expressions. And if `tz` is not supported somehow, we `stop()` if `tz` is provided.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r633924437



##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)

Review comment:
       @jorisvandenbossche [it seems that](https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/arrow/util/value_parsing.h#L663) we don't really use or pass zone information even if strptime captures it. The following passes:
   ```
     options.timestamp_parsers = {TimestampParser::MakeStrptime("%Y-%m-%d %H:%M:%S %Z")};
     AssertConversion<TimestampType, int64_t>(type, {"1970-01-01 00:00:00 Etc/GMT+6,1970-01-01 00:00:00 UTC\n"}, {{0}, {0}}, options);
   ```
   So timestamp's timezone is currently ignored and the local time is returned. It might be good to document this or even block `%z` and `%Z` to avoid surprises?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r633821222



##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)

Review comment:
       Right. I'm not sure, can you cast `timestamp[ms]` to `timestamp[ms, tz="UTC"]` or whatever (without modifying the values in the array, just to set the tz)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r633148838



##########
File path: r/R/compute.R
##########
@@ -286,3 +286,8 @@ cast_options <- function(safe = TRUE, ...) {
   )
   modifyList(opts, list(...))
 }
+
+strptime_arrow <- function(..., format, unit){
+  a <- collect_arrays_from_dots(list(...))

Review comment:
       I don't think you want `collect_arrays_from_dots` here. This function exists to support the base R behavior like:
   
   ```
   > sum(1, 2)
   [1] 3
   ```
   
   But `strptime` doesn't take `...` like that.

##########
File path: r/R/compute.R
##########
@@ -286,3 +286,8 @@ cast_options <- function(safe = TRUE, ...) {
   )
   modifyList(opts, list(...))
 }
+
+strptime_arrow <- function(..., format, unit){

Review comment:
       I'm not sure how useful this function is since it is a thin wrapper around `call_function()` and we can't set it as an S3 method.. More useful would be to add a version of this in the `nse_funcs` in dplyr-functions.R.
   
   In either case, we should match the `base::strptime()` signature: `function (x, format, tz = "")` with the possible addition of `unit` if that's an Arrow feature. 
   
   Also, should `format` and `unit` have default arguments?

##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)
+  timestamps <- Array$create(c(as.POSIXct("2018-10-07 19:04:05"), NA))
+  # array of parsed timestamps
+  parsed_timestamps <- strptime_arrow(time_strings, format = "%Y-%m-%d %H:%M:%S", unit = TimeUnit$MICRO)

Review comment:
       `unit` should also take human-friendly strings ("s", "ms", etc.); see how this is done in the `timestamp()` function in type.R.

##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)

Review comment:
       Why not? 

##########
File path: r/src/compute.cpp
##########
@@ -233,6 +233,12 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
                                      max_replacements);
   }
 
+  if (func_name == "strptime") {
+    using Options = arrow::compute::StrptimeOptions;

Review comment:
       Does `StrptimeOptions` have a `Defaults()` method in C++? If so, we should call it.

##########
File path: r/src/compute.cpp
##########
@@ -233,6 +233,12 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
                                      max_replacements);
   }
 
+  if (func_name == "strptime") {
+    using Options = arrow::compute::StrptimeOptions;

Review comment:
       Yeah looks like it. I made ARROW-12809 to evaluate whether that's correct, but for the purposes of the PR, it's fine.

##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)

Review comment:
       You're right that strptime in the C++ library doesn't take a timezone argument. Maybe it is expected that if there is a timezone, it will be encoded in the string and parseable by strptime (with the right format string)? But this gets us into the always tricky area of timezone-aware vs. timezone-naive data. @jorisvandenbossche do you have any thoughts/experience with this code?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r640790181



##########
File path: r/R/dplyr-functions.R
##########
@@ -338,3 +338,24 @@ get_stringr_pattern_options <- function(pattern) {
 contains_regex <- function(string) {
   grepl("[.\\|()[{^$*+?]", string)
 }
+
+nse_funcs$strptime <- function(x, format = "%Y-%m-%d %H:%M:%S", tz = NULL, unit = 1L) {

Review comment:
       Should this default be the more readable "s"? 

##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -493,3 +493,81 @@ test_that("edge cases in string detection and replacement", {
     tibble(x = c("ABC"))
   )
 })
+
+test_that("strptime", {
+
+  t_string <- tibble(x = c("2018-10-07 19:04:05", NA))
+  t_stamp <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA))
+  t_stampPDT <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "PDT"), NA))
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x)
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S")
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S", unit = TimeUnit$NANO)

Review comment:
       Would it be possible to change this to:
   
   ```suggestion
           x = strptime(x, format = "%Y-%m-%d %H:%M:%S", unit = "ns")
   ```
   
   So it's a little closer to what we expect a person would use in practice?

##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -493,3 +493,81 @@ test_that("edge cases in string detection and replacement", {
     tibble(x = c("ABC"))
   )
 })
+
+test_that("strptime", {
+
+  t_string <- tibble(x = c("2018-10-07 19:04:05", NA))
+  t_stamp <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA))
+  t_stampPDT <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "PDT"), NA))

Review comment:
       It doesn't look like this is used later, though I could be missing something. If not, could you remove it?

##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -493,3 +493,81 @@ test_that("edge cases in string detection and replacement", {
     tibble(x = c("ABC"))
   )
 })
+
+test_that("strptime", {
+
+  t_string <- tibble(x = c("2018-10-07 19:04:05", NA))
+  t_stamp <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA))
+  t_stampPDT <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "PDT"), NA))
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x)
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S")
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(

Review comment:
       Do you know if these would be equal? I'm not super familiar with how this precision is measured/handled in lubridate/R.

##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -493,3 +493,81 @@ test_that("edge cases in string detection and replacement", {
     tibble(x = c("ABC"))
   )
 })
+
+test_that("strptime", {
+
+  t_string <- tibble(x = c("2018-10-07 19:04:05", NA))
+  t_stamp <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA))
+  t_stampPDT <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "PDT"), NA))
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x)
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S")
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S", unit = TimeUnit$NANO)
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S", unit = "s")
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S", tz="PDT")
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )

Review comment:
       I'm curious what you're testing with this test. Could you explain a little bit more about the case that it's testing?

##########
File path: r/R/dplyr-functions.R
##########
@@ -338,3 +338,24 @@ get_stringr_pattern_options <- function(pattern) {
 contains_regex <- function(string) {
   grepl("[.\\|()[{^$*+?]", string)
 }
+
+nse_funcs$strptime <- function(x, format = "%Y-%m-%d %H:%M:%S", tz = NULL, unit = 1L) {
+  # Arrow uses unit for time parsing, strptime() does not.
+  # Arrow has no default option for strptime (format, unit),
+  # we suggest following format = "%Y-%m-%d %H:%M:%S", unit = MILLI/1L/"ms",
+  # (ARROW-12809)
+
+  # ParseTimestampStrptime currently ignores the timezone information (ARROW-12820).
+  # Stop if tz is provided.
+  if (is.character(tz)) {
+    arrow_not_supported("Time zone argument")
+  }
+
+  t_unit <- make_valid_time_unit(unit,c("s" = TimeUnit$SECOND, "ms" = TimeUnit$MILLI, "us" = TimeUnit$MICRO, "ns" = TimeUnit$NANO))

Review comment:
       To match a little more closely `timestamp()`:
   
   ```suggestion
     unit <- make_valid_time_unit(unit, c(valid_time64_units, valid_time32_units))
   ```
   
   And then change `t_unit` to `unit` below (unless you have a need to keep both around?)

##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -493,3 +493,81 @@ test_that("edge cases in string detection and replacement", {
     tibble(x = c("ABC"))
   )
 })
+
+test_that("strptime", {
+
+  t_string <- tibble(x = c("2018-10-07 19:04:05", NA))
+  t_stamp <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA))
+  t_stampPDT <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "PDT"), NA))
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x)
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S")
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S", unit = TimeUnit$NANO)
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S", unit = "s")
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  expect_equivalent(
+    t_string %>%
+      Table$create() %>%
+      mutate(
+        x = strptime(x, format = "%Y-%m-%d %H:%M:%S", tz="PDT")
+      ) %>%
+      collect(),
+    t_stamp,
+    check.tzone = FALSE
+  )
+
+  tstring <- tibble(x = c("08-05-2008", NA))
+  tstamp <- tibble(x = c(lubridate::mdy("08/05/2008"), NA))
+  tstamp[[1]] <- as.POSIXct(tstamp[[1]])

Review comment:
       I wonder if it would be clearer to do something like `strptime("08-05-2008", format = "%m-%d-%Y")` to generate the expectation here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r641512824



##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -493,3 +493,81 @@ test_that("edge cases in string detection and replacement", {
     tibble(x = c("ABC"))
   )
 })
+
+test_that("strptime", {
+
+  t_string <- tibble(x = c("2018-10-07 19:04:05", NA))
+  t_stamp <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA))
+  t_stampPDT <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "PDT"), NA))

Review comment:
       I think that that's what we want actually. 
   
   In many other cases where something isn't (yet) supported in Arrow we automatically pull the data into R with a warning (in some circumstances like this). You might have found this already, but the pattern you propose for the test in your comment matches what we do elsewhere https://github.com/apache/arrow/blob/master/r/tests/testthat/test-dplyr-string-functions.R#L360-L369 which is good (comments about that test also have a bit more explanation about what's going on when the data warnings+is pulled in)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10334: ARROW-12198: [R] bindings for strptime

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10334:
URL: https://github.com/apache/arrow/pull/10334#discussion_r633762873



##########
File path: r/tests/testthat/test-Array.R
##########
@@ -291,6 +291,17 @@ test_that("Timezone handling in Arrow roundtrip (ARROW-3543)", {
   expect_identical(read_feather(feather_file), df)
 })
 
+test_that("strptime", {
+  # array of strings
+  time_strings <- Array$create(c("2018-10-07 19:04:05", NA))
+  # array of timestamps (doesn't work if tz="" is added!)

Review comment:
       If I understand correctly [system strptime](https://man7.org/linux/man-pages/man3/strptime.3.html) is used so `%Z` or `%z` would work. E.g.: `2020-01-01 23:23:14 Europe/Amsterdam` would be captured by `format = "%Y-%m-%d %H:%M:%S %Z"`.
   Capturing timezones would be great IMO but I would listen to Joris here for sure :).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org