You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/15 17:07:39 UTC

[GitHub] [arrow] dragosmg opened a new pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

dragosmg opened a new pull request #12433:
URL: https://github.com/apache/arrow/pull/12433


   The following will be supported in arrow (`base::as.Date()` included to show the difference to `date()`, but not supported):
   ``` r
   suppressPackageStartupMessages(library(dplyr))
   suppressPackageStartupMessages(library(lubridate))
   
   df <- tibble(a = as.POSIXct("2012-03-26 23:12:13", tz = "America/New_York"))
   
   df %>%
     mutate(
       a_date = date(a),
       a_date_base = as.Date(a))
   #> # A tibble: 1 × 3
   #>   a                   a_date     a_date_base
   #>   <dttm>              <date>     <date>     
   #> 1 2012-03-26 23:12:13 2012-03-26 2012-03-27
   ```
   
   ``` r
   suppressPackageStartupMessages(library(arrow))
   suppressPackageStartupMessages(library(dplyr))
   suppressPackageStartupMessages(library(lubridate))
   
   df <- tibble(a = as.POSIXct("2012-03-26 23:12:13", tz = "America/New_York"))
   df %>% 
     arrow_table() %>% 
     mutate(
       a_date = date(a)
     ) %>% 
     collect()
   #> # A tibble: 1 × 2
   #>   a                   a_date    
   #>   <dttm>              <date>    
   #> 1 2012-03-26 23:12:13 2012-03-26
   ```
   
   <sup>Created on 2022-02-15 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817579897



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # actual and expected differ due to doubles are accounted for (floored in
+  # arrow and rounded to the next decimal in R)
+  expect_error(
+    compare_dplyr_binding(
+      .input %>%
+        mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+        collect(),
+      test_df
+    )
+  )

Review comment:
       Initially I implemented it with `round()`, but this rounds up if the fractional part is greater than 0.5, which I don't think is what we want. 
   ``` r
   as.Date(34, origin = "1970-01-01") 
   #> [1] "1970-02-04"
   as.Date(34.56, origin = "1970-01-01") 
   #> [1] "1970-02-04"
   as.Date(round(34.56), origin = "1970-01-01")
   #> [1] "1970-02-05"
   as.Date(floor(34.56), origin = "1970-01-01")
   #> [1] "1970-02-04"
   ```
   
   <sup>Created on 2022-03-02 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817932243



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # actual and expected differ due to doubles are accounted for (floored in
+  # arrow and rounded to the next decimal in R)
+  expect_error(
+    compare_dplyr_binding(
+      .input %>%
+        mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+        collect(),
+      test_df
+    )
+  )

Review comment:
       Thanks for the explanation. Is this part of the comment still accurate then: `(floored in arrow and rounded to the next decimal in R)`? 
   
   I suspect (but don't know for certain!) what's going on is that you're running into how R stores dates and how that differs from [`date32()`](https://arrow.apache.org/docs/cpp/api/datatype.html?highlight=date32#classarrow_1_1_date32_type) in Arrow. In R, a date object can be a float (I haven't looked at the source to see if it's _always_ stored as a float, but that would be interesting to know!) and that number is number of days since the epoch [1]. So in R you can have fractional days:
   
   ```
   > as.Date(36.54, origin = "1970-01-01")
   [1] "1970-02-06"
   > as.numeric(as.Date(36.54, origin = "1970-01-01"))
   [1] 36.54
   ```
   
   So if you add a small amount (but enough to get to the next whole number you'll see a new date:
   
   ```
   > as.Date(36.54, origin = "1970-01-01") + 0.46
   [1] "1970-02-07"
   ```
   
   But if we actually floored here, we would get the integer, and adding the same amount won't get you to the next day (just to a bit before noon here):
   
   ```
   > as.numeric(as.Date(floor(36.54), origin = "1970-01-01"))
   [1] 36
   > as.Date(floor(36.54), origin = "1970-01-01") + 0.46
   [1] "1970-02-06"
   ```
   
   Soooo, this means for us that we need to choose from (in order of best to worst IMO, but all would be fine I think):
   
   * Store a more precise value (e.g. as [`date64()`](https://arrow.apache.org/docs/cpp/api/datatype.html?highlight=date32#classarrow_1_1_date64_type) though we can't simply `cast(x, date64())` because `date64()` stores milliseconds since the epoch. We also might still have some complications comparison — I haven't experimented with `date64()` objects getting pulled back into R and if they come in as Dates backed by floats.
   * Not accept non-integers at all with an error and make a Jira to clean this up later.
   * Accept that Arrow simple floors, and the actual numeric values are different 
   
   
   
   [1] — and it actually is the epoch, it converts from a different origin:
   
   ```
   as.numeric(as.Date(36.54, origin = "1999-12-31"))
   [1] 10992.54
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817601442



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )

Review comment:
       _I think_ they're remnants from a previous iteration. Removed. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r810508832



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -711,3 +711,23 @@ test_that("am/pm mirror lubridate", {
   )
 
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date

Review comment:
       This is only necessary in our testing — right? Something about the way that `compare_dplyr_binding()` or testthat works makes it so we're not getting `lubridate::date()` by default here even though presumably we should since we  call `library(lubridate, warn.conflicts = FALSE)` at the top of this file, yeah?

##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -147,5 +147,7 @@ register_bindings_datetime <- function() {
   register_binding("pm", function(x) {
     !call_binding("am", x)
   })
-
+  register_binding("date", function(x) {
+    x$cast(date32())
+  })

Review comment:
       Similar to https://github.com/apache/arrow/pull/12431#discussion_r810508212 are the errors that we get if `x` here is something that would be workable for our users?
   
   Separately, what happens if you pass an R object to this instead of a column in an arrow table? Does that work ok?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r811036678



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -147,5 +147,7 @@ register_bindings_datetime <- function() {
   register_binding("pm", function(x) {
     !call_binding("am", x)
   })
-
+  register_binding("date", function(x) {
+    x$cast(date32())
+  })

Review comment:
       Q1: I have added some unit tests for _unsupported_ types. And discovered that, unlike {lubridate}, you can cast a `date` from an `integer` in {arrow}. It looks like the integer is interpreted as number of days since epoch. 
   ``` r
   suppressPackageStartupMessages(library(arrow))
   call_binding("date", Scalar$create(32L))
   #> Scalar
   #> 1970-02-02
   
   lubridate::date(32L)
   #> Error in as.POSIXlt.numeric(x, tz = tz(x)): 'origin' must be supplied
   ```
   
   <sup>Created on 2022-02-21 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup> 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817549370



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use `floor` before casting. `floor` is also a bit safer than
+        # int32() with `safe = FALSE` since it doesn't switch to `ceiling` for
+        # negative numbers
+        # TODO revisit if arrow decides to support double -> date casting
+        x <- call_binding("floor", x)
+        x <- build_expr("cast", x, options = cast_options(to_type = int32()))
+      }
+      if (origin != "1970-01-01") {
+        abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")
+      }
+    }

Review comment:
       While I generally agree that argument validation should happen at the top of a function. I'm on the fence on this one. The origin argument is only relevant when trying to coerce a numeric, so I'm a bit reluctant to move it outside the `numeric` block. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r812033214



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -147,5 +147,7 @@ register_bindings_datetime <- function() {
   register_binding("pm", function(x) {
     !call_binding("am", x)
   })
-
+  register_binding("date", function(x) {
+    x$cast(date32())
+  })

Review comment:
       Take a look at that warning/error message though: recycling of length one _does_ work. 
   
   If you want some more context on that I would recommend looking at the PR / jiras related to the error message there "only values of size one are recycled" (or possibly even comments around that code. That's what I'd do in this scenario, at least!) and that should have some more context about what/why/how that works.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r818790972



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,48 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    # the origin argument will be better supported once we implement temporal
+    # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+    # TODO revisit once the above has been sorted
+    if (call_binding("is.numeric", x) & origin != "1970-01-01") {
+      abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")
+    }
+
+    # this could be improved with tryFormats once strptime returns NA and we
+    # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+    # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+    if (is.null(format) && length(tryFormats) > 1) {
+      abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow")
+    }
+
+    if (call_binding("is.Date", x)) {
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      format <- format %||% tryFormats[[1]]
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#issuecomment-1058320886


   Benchmark runs are scheduled for baseline = ce46c1adf10654248b02c349210bb63ecc6828ad and contender = 9719eae66dcf38c966ae769215d27020a6dd5550. 9719eae66dcf38c966ae769215d27020a6dd5550 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/f23d468014004a348cc275618eb8dbd7...0caa28678d194dc0b298bad2a2b53ca3/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b4207e561b6447d984f6e26c65ee020d...f85c32d499a944df8a229a51f0aa8efb/)
   [Failed :arrow_down:3.21% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2dd547de0ff44aeab70df443df281742...5a56a8e391904fa38b23a33f43ab61d6/)
   [Finished :arrow_down:0.26% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/1595f8e1d2c84a83928b04b87805d46c...8ea69c725c34424e867ffbbafe2533e4/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817900666



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }

Review comment:
       My point wasn't about where the arg check was happening here, my point was: I don't think you need to do the arg check at all. I believe we have all the machinery in Arrow to support timezones other than UTC both in the input and as an argument here. 
   
   This is a more explicit version of above, but the same general principle: 
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   
   df <- data.frame(time = as.POSIXct("2020-01-01 23:30:00", tz = "America/Chicago"))
   
   df %>% 
     mutate(
       as_date = as.Date(time),
       as_date_nyc = as.Date(time, tz = "America/New_York"),
       as_date_chi = as.Date(time, tz = "America/Chicago"),
       as_date_lax = as.Date(time, tz = "America/Los_Angeles")
     ) %>%
     collect()
   #>                  time    as_date as_date_nyc as_date_chi as_date_lax
   #> 1 2020-01-01 23:30:00 2020-01-02  2020-01-02  2020-01-01  2020-01-01
   
   df %>% 
     arrow_table() %>%
     mutate(
       as_date = cast(cast(time, timestamp(timezone = "UTC")), date32()),
       as_date_nyc = cast(cast(time, timestamp(timezone = "America/New_York")), date32()),
       as_date_chi = cast(cast(time, timestamp(timezone = "America/Chicago")), date32()),
       as_date_lax = cast(cast(time, timestamp(timezone = "America/Los_Angeles")), date32())
     ) %>%
     collect()
   #>                  time    as_date as_date_nyc as_date_chi as_date_lax
   #> 1 2020-01-01 23:30:00 2020-01-02  2020-01-02  2020-01-01  2020-01-01
   ```
    
    
    Am I missing something here? Or misunderstanding the issue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816601524



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,94 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date = date(a)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date_base = as.Date(a)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(b = date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(b_base = as.Date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+})
+
+test_that("date() errors with unsupported inputs", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56,
+    boolean_var = TRUE
+  )
+
+  # date from integer supported in arrow (similar to base::as.Date()), but in
+  # Arrow it assumes a fixed origin "1970-01-01"
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      select(integer_var) %>%
+      mutate(date_int = date(integer_var)) %>%
+      collect(),
+    tibble(integer_var = 32L,
+           date_int = as.Date("1970-02-02"))
+  )

Review comment:
       Done. Moved it to the non-error behaviour. That is a case where `lubridate::date()` would error without a value for `origin`. Technically we could decide to error when origin is not present and mimic the lubridate behaviour closer (I think that was my initial intention, hence that test being in the error {testthat} block) 
   ``` r
   lubridate::date(32L)
   #> Error in as.POSIXlt.numeric(x, tz = tz(x)): 'origin' must be supplied
   ```
   
   <sup>Created on 2022-03-01 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r815943524



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,53 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    # browser()
+    if (call_binding("is.Date", x)) {
+      # arrow_date <- build_expr("cast", x, options = cast_options(to_type = date32()))
+      return(x)
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        arrow_timestamp <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (!is.null(format)) {
+        arrow_timestamp <- call_binding("strptime", x, format, unit = "s")
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` without `format` is not supported in Arrow")
+      }
+
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting
+        x <- call_binding("floor", x)
+        x <- build_expr("cast", x, options = (cast_options(to_type = int32())))
+      }
+      if (origin == "1970-01-01") {
+        return(build_expr("cast", x, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")

Review comment:
       [ARROW-15798](https://issues.apache.org/jira/browse/ARROW-15798) to discuss supporting a different origin and casting from double to date in Arrow. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#issuecomment-1058236842


   Thanks for your patience and support 😉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817545112



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }

Review comment:
       I wrote it like I did for 2 reasons:
   * firstly, to have a clear error message for the user
   * secondly, to leave it in (what I thought was) best shape for follow-up work




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#issuecomment-1056922140


   Thanks @jonkeane. Would you mind having another look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816583722



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }
+      interim_x <- call_binding("strptime", x, format, unit = "s")
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting
+        # TODO revisit if arrow decides to support double -> date casting
+        x <- call_binding("floor", x)
+        interim_x <- build_expr("cast", x, options = (cast_options(to_type = int32())))
+      } else {
+        interim_x <- x
+      }

Review comment:
       Done. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816585997



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,94 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done

Review comment:
       Done (added more details in the comment). 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816625715



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }

Review comment:
       I introduced that step (converting a character `x` to `Expression`) as a way of dealing with the errors we were seeing with `filter()`. In `filter.arrow_dplyr_query()` we get the following error after tidy-eval-ing the filter expression:
   ```r
   > filters <- lapply(filts, arrow_eval, arrow_mask(.data))
   > filters
   [[1]]
   [1] "Expression arguments must be Expression objects"
   attr(,"class")
   [1] "try-error"
   attr(,"condition")
   <assertError: Expression arguments must be Expression objects>
   ``` 
   
   where `filts` is:
   ```r
   > filts
   <list_of<quosure>>
   
   [[1]]
   <quosure>
   expr: ^ts >= as.Date("2015-05-04")
   env:  0x297eaa768
   ``` 
   
   and the original test is:
   ```r
   # ds is an open dataset
   ds %>%
         filter(ts >= as.Date("2015-05-04")) %>%
         filter(part == 1) %>%
         select(ts) %>%
         collect()
   ```
   
   As far as I understand it, it fails due to `"2015-05-04"` being a character and not an `Expression` object. Inside the `filter. ...` method:
   ```r
   > b <- arrow_eval(filts[[1]], arrow_mask(.data))
   > b
   [1] "Expression arguments must be Expression objects"
   attr(,"class")
   [1] "try-error"
   attr(,"condition")
   <assertError: Expression arguments must be Expression objects>
   ```
   Hence my approach: if we somehow get to that point in the function body and, for some reason, character `x` is not yet an `Expression`, make it into one. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r818756056



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,48 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    # the origin argument will be better supported once we implement temporal
+    # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+    # TODO revisit once the above has been sorted
+    if (call_binding("is.numeric", x) & origin != "1970-01-01") {
+      abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")
+    }

Review comment:
       This is a great approach, thanks for this!

##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -767,4 +768,81 @@ test_that("nested structs can be created from scalars and existing data frames",
       collect(),
     tibble(a = 1:2)
   )
+
+  })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # currently we do not support an origin different to "1970-01-01"
+  expect_warning(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_int = as.Date(integer_var, origin = "1970-01-03")) %>%
+      collect(),
+    regexp = "`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow",
+    fixed = TRUE
+  )

Review comment:
       Could this (and the few other `expect_warnings()` below) use the form:
   
   ```
     compare_dplyr_binding(
        .input %>%
           arrow_table() %>%
           mutate(date_int = as.Date(integer_var, origin = "1970-01-03")) %>%
           collect(),
         test_df,
         warning = TRUE
       )
   ```
   
   Which we use elsewhere to show that this is a "not support in Arrow, pulling into R". If you wanted to put the specific error in a comment above, that would also be fine to show _why_ it's being pulled into R for folks who are reading the tests.

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,48 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    # the origin argument will be better supported once we implement temporal
+    # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+    # TODO revisit once the above has been sorted
+    if (call_binding("is.numeric", x) & origin != "1970-01-01") {
+      abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")
+    }
+
+    # this could be improved with tryFormats once strptime returns NA and we
+    # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+    # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+    if (is.null(format) && length(tryFormats) > 1) {
+      abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow")
+    }
+
+    if (call_binding("is.Date", x)) {
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))

Review comment:
       Very nice comment

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,48 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    # the origin argument will be better supported once we implement temporal
+    # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+    # TODO revisit once the above has been sorted
+    if (call_binding("is.numeric", x) & origin != "1970-01-01") {
+      abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")
+    }
+
+    # this could be improved with tryFormats once strptime returns NA and we
+    # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+    # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+    if (is.null(format) && length(tryFormats) > 1) {
+      abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow")
+    }
+
+    if (call_binding("is.Date", x)) {
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      format <- format %||% tryFormats[[1]]
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))

Review comment:
       It might be nice to have a comment above here that explains what `unit = 0L` means in human terms just so that someone doesn't have to go looking for it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r811045008



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -147,5 +147,7 @@ register_bindings_datetime <- function() {
   register_binding("pm", function(x) {
     !call_binding("am", x)
   })
-
+  register_binding("date", function(x) {
+    x$cast(date32())
+  })

Review comment:
       Q2: Not sure what we want to achieve by calling a binding (in this case `"date"`) on an R object. Do we want to support this (should all bindings support R objects as inputs?) or do we just want to make sure the error message is meaningful? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817205283



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }

Review comment:
       We chatted about this a bit earlier, have you tried to get rid of this bit here? I suspect that you're actually close but this should totally be possible. 
   
   You might need to have `tz` here be hardcoded as `UTC`, eg:
   
   ```
   x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone ="UTC")))
   ```
   
   But I _think_ that should do it and just work.
   
   Here's a test I did with arrow from the master branch:
   ```
   > df <- data.frame(time = as.POSIXct("2020-01-01 22:00:00", tz = "America/Chicago"))
   > df %>% arrow_table() %>% mutate(strftime = strftime(time, "%Y-%m-%d"), as_date = cast(time, date32()), as_date_utc = cast(cast(time, timestamp(timezone = "UTC")), date32())) %>% collect()
                    time   strftime    as_date as_date_utc
   1 2020-01-01 22:00:00 2020-01-01 2020-01-01  2020-01-02
   ```

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))

Review comment:
       What you have is a bit more exhaustive (and the hard coded `[[1]]` in the call isn't typical, but this should have the same behavior and saves us one level of if/else nesting (which is not usually a big deal, but in this function we've got a bunch already!).
   
   Alternatively, you could make the default `format` _be_ `tryFormats[[1]]` in the signature. That would accomplish the same thing and you could have just the `if() { abort() }` and then `build_expr(... format = format)`
   
   ```suggestion
         if (is.null(format) && length(tryFormats) == 1) {
           abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
         }
         
         format <- format %||% tryFormats[[1]]
         x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
   ```

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use `floor` before casting. `floor` is also a bit safer than
+        # int32() with `safe = FALSE` since it doesn't switch to `ceiling` for
+        # negative numbers
+        # TODO revisit if arrow decides to support double -> date casting
+        x <- call_binding("floor", x)
+        x <- build_expr("cast", x, options = cast_options(to_type = int32()))
+      }
+      if (origin != "1970-01-01") {
+        abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")
+      }
+    }

Review comment:
       Minor, but could we move this to the top of the function? There's not much computation going on above so I'm not super worried about the time it takes to get to the error, but it's helpful to have argument validation at the top of a function

##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,99 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    posixct_date = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"),
+    integer_var = c(32L, NA))
+
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used. This is due to the way testthat runs and
+  # normal use of arrow would not have to do this explicitly.
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date = date(posixct_date)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date_base = as.Date(posixct_date)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(date_from_r_object = date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(as_date_from_r_object = as.Date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  # date from integer supported in arrow (similar to base::as.Date()), but in
+  # Arrow it assumes a fixed origin "1970-01-01". However this is not supported
+  # by lubridate. lubridate::date(integer_var) errors without an `origin`
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      select(integer_var) %>%
+      mutate(date_int = date(integer_var)) %>%
+      collect(),
+    tibble(integer_var = c(32L, NA),
+           date_int = as.Date(c("1970-02-02", NA)))
+  )
+})
+
+test_that("date() errors with unsupported inputs", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56,
+    boolean_var = TRUE
+  )
+
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_char = date(character_ymd_var)) %>%
+      collect(),
+    regexp = "Unsupported cast from string to date32 using function cast_date32"
+  )
+
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_bool = date(boolean_var)) %>%
+      collect(),
+    regexp = "Unsupported cast from bool to date32 using function cast_date32"
+  )
+
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = date(double_var)) %>%
+      collect(),
+    regexp = "Unsupported cast from double to date32 using function cast_date32"
+  )

Review comment:
       Should we remove columns from this data.frame that we're not using in the tests?
   
   Also is it possible to use R-scalar values in the pipelines themselves:
   
   ```
     expect_error(
         test_df %>%
           arrow_table() %>%
           mutate(date_char = date("2022-02-25 00:00:01")) %>%
           collect(),
         regexp = "Unsupported cast from string to date32 using function cast_date32"
       )
   ```
   
   That would be a bit cleaner + easier to parse IMO

##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,99 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    posixct_date = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"),
+    integer_var = c(32L, NA))
+
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used. This is due to the way testthat runs and
+  # normal use of arrow would not have to do this explicitly.
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date = date(posixct_date)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date_base = as.Date(posixct_date)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(date_from_r_object = date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(as_date_from_r_object = as.Date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  # date from integer supported in arrow (similar to base::as.Date()), but in
+  # Arrow it assumes a fixed origin "1970-01-01". However this is not supported
+  # by lubridate. lubridate::date(integer_var) errors without an `origin`
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      select(integer_var) %>%
+      mutate(date_int = date(integer_var)) %>%
+      collect(),
+    tibble(integer_var = c(32L, NA),
+           date_int = as.Date(c("1970-02-02", NA)))
+  )
+})
+
+test_that("date() errors with unsupported inputs", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56,
+    boolean_var = TRUE
+  )
+
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_char = date(character_ymd_var)) %>%
+      collect(),
+    regexp = "Unsupported cast from string to date32 using function cast_date32"
+  )
+
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_bool = date(boolean_var)) %>%
+      collect(),
+    regexp = "Unsupported cast from bool to date32 using function cast_date32"
+  )
+
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = date(double_var)) %>%
+      collect(),
+    regexp = "Unsupported cast from double to date32 using function cast_date32"
+  )
+

Review comment:
       ```suggestion
   ```

##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )

Review comment:
       And same for the test below too

##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # actual and expected differ due to doubles are accounted for (floored in
+  # arrow and rounded to the next decimal in R)
+  expect_error(
+    compare_dplyr_binding(
+      .input %>%
+        mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+        collect(),
+      test_df
+    )
+  )

Review comment:
       Is this what we want? When I was reading the `floor` stuff in the code I actually assumed that we were doing that because that's what R did. But seeing this test now, it sounds like we probably ought to `round` instead of `floor`, yeah? 
   
   Or is there some other reason why `floor` is what we want to do in R?

##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )

Review comment:
       Could you tell me more about this test and what it's testing? What I'm seeing is basically testing that the default is `origin = "1970-01-01"`, but maybe I'm missing something?

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {

Review comment:
       Could we collapse these into this?
   
   ```
   } else if (call_binding("is.numeric", x) & !call_binding("is.integer", x)) {
   ```
   
   I know eventually we might need to split them out, but again this will give us one less indentation level

##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # actual and expected differ due to doubles are accounted for (floored in
+  # arrow and rounded to the next decimal in R)
+  expect_error(
+    compare_dplyr_binding(
+      .input %>%
+        mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+        collect(),
+      test_df
+    )
+  )
+
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect(),
+    test_df %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")),
+    # the absolute value for date_double is different due to arrow casting from
+    # integer and r from double => testing with a tolerance of 0.6
+    # `actual$date_double`: 34.0
+    # `expected$date_double`: 34.6
+    tolerance = 0.6
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect(),
+    test_df %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")),
+    # the absolute value for date_double is different due to arrow casting from
+    # integer and r from double => testing with a tolerance of 0.6
+    # `actual$date_double`: 34.0
+    # `expected$date_double`: 34.6
+    tolerance = 0.6
+  )
+
+  # currently we do not support an origin different to "1970-01-01"
+  expect_warning(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_int = as.Date(integer_var, origin = "1970-01-03")) %>%
+      collect(),
+    regexp = "`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow",
+    fixed = TRUE
+

Review comment:
       ```suggestion
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817576084



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # actual and expected differ due to doubles are accounted for (floored in
+  # arrow and rounded to the next decimal in R)
+  expect_error(
+    compare_dplyr_binding(
+      .input %>%
+        mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+        collect(),
+      test_df
+    )
+  )
+
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect(),
+    test_df %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")),
+    # the absolute value for date_double is different due to arrow casting from
+    # integer and r from double => testing with a tolerance of 0.6
+    # `actual$date_double`: 34.0
+    # `expected$date_double`: 34.6
+    tolerance = 0.6
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect(),
+    test_df %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")),
+    # the absolute value for date_double is different due to arrow casting from
+    # integer and r from double => testing with a tolerance of 0.6
+    # `actual$date_double`: 34.0
+    # `expected$date_double`: 34.6
+    tolerance = 0.6
+  )
+
+  # currently we do not support an origin different to "1970-01-01"
+  expect_warning(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_int = as.Date(integer_var, origin = "1970-01-03")) %>%
+      collect(),
+    regexp = "`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow",
+    fixed = TRUE
+

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817906840



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))

Review comment:
       _nods_ I totally understand why the structure happened. In many cases it would be NBD to speculatively put in structure like this, but here I found it quite hard to reason about what was going on when and why because of the nesting. One example: I honestly hadn't totally put together that the check on `origin` only happened for numerics until you explicitly said that(!).
   
   Another option would be to break all of these into helper functions where we encapsulate all of the logic for each type and can go and inspect that as needed. But that's probably more structure than we need right now — and who knows if the C++ and other tickets will actually be resolves (or resolves in ways that would fit with this structure).
   
   This is admittedly a style point — I'm not going to block this from merging solely because of that, but I have found the nested if/elses rather difficult to review + keep all of the choices in my head while doing so.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane closed pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane closed pull request #12433:
URL: https://github.com/apache/arrow/pull/12433


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#issuecomment-1058320886


   Benchmark runs are scheduled for baseline = ce46c1adf10654248b02c349210bb63ecc6828ad and contender = 9719eae66dcf38c966ae769215d27020a6dd5550. 9719eae66dcf38c966ae769215d27020a6dd5550 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/f23d468014004a348cc275618eb8dbd7...0caa28678d194dc0b298bad2a2b53ca3/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b4207e561b6447d984f6e26c65ee020d...f85c32d499a944df8a229a51f0aa8efb/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2dd547de0ff44aeab70df443df281742...5a56a8e391904fa38b23a33f43ab61d6/)
   [Finished :arrow_down:0.26% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/1595f8e1d2c84a83928b04b87805d46c...8ea69c725c34424e867ffbbafe2533e4/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r818512531



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }

Review comment:
       You are right. My thinking on this was wrong. I thought we did not support other timezones. The `abort()` when `tz` is not `"UTC"` is not needed. I have updated the code and included a test to reflect this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r810919455



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -711,3 +711,23 @@ test_that("am/pm mirror lubridate", {
   )
 
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date

Review comment:
       As far as I can tell, that is correct (it is only necessary for testing). I never had this issue in an interactive session, after attaching lubridate. I don't think we will be 100% sure until we support namespacing with `pkg::function()` syntax. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r815950557



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,53 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    # browser()
+    if (call_binding("is.Date", x)) {
+      # arrow_date <- build_expr("cast", x, options = cast_options(to_type = date32()))
+      return(x)
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        arrow_timestamp <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (!is.null(format)) {
+        arrow_timestamp <- call_binding("strptime", x, format, unit = "s")
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {

Review comment:
       Solved. I've added the `tryFormats` arg but with a single option - we're not trying all the supplied formats. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816657917



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }

Review comment:
       > And separately — we should make a Jira if one doesn't already exist to see if we can allow `strptime(time, format = "%Y-%m-%d")` to work (which does in R)
   > 
   > ```
   > > strptime("2020-01-01 00:01:00", "%Y-%m-%d")
   > [1] "2020-01-01 CST"
   > ```
   
   Done. [ARROW-15813](https://issues.apache.org/jira/browse/ARROW-15813). Good catch - I clocked it, but it somehow skipped my mind. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r813912419



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,85 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date = date(a)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date_base = as.Date(a)) %>%
+      collect(),
+    test_df
+  )
+
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+  compare_dplyr_binding(
+    .input %>%
+      mutate(b = date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(b_base = as.Date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+

Review comment:
       Do we want to remove all the tests below this point?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817601442



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )

Review comment:
       I think they're remnants from a previous iteration. Removed. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817566088



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))

Review comment:
       Done. Again, the reason I wrote it like I did was to leave us (_imho_) in the best position for the follow-up work. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r818711237



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # actual and expected differ due to doubles are accounted for (floored in
+  # arrow and rounded to the next decimal in R)
+  expect_error(
+    compare_dplyr_binding(
+      .input %>%
+        mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+        collect(),
+      test_df
+    )
+  )

Review comment:
       Thanks for all this exploration on `date64()` that's super super helpful, and exactly the right approach for these kinds of problems. And thank you for bringing this up in https://issues.apache.org/jira/browse/ARROW-15798 which is also a good place for it. I'll comment this there as well, but this does make me wonder if our conversion of date64() from arrow -> POSIXct in R might not be quite right! But we definitely don't need to tackle that in this PR!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r818707137



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use `floor` before casting. `floor` is also a bit safer than
+        # int32() with `safe = FALSE` since it doesn't switch to `ceiling` for
+        # negative numbers
+        # TODO revisit if arrow decides to support double -> date casting
+        x <- call_binding("floor", x)
+        x <- build_expr("cast", x, options = cast_options(to_type = int32()))
+      }
+      if (origin != "1970-01-01") {
+        abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")
+      }
+    }

Review comment:
       _nods_ yeah, so maybe we should keep it only when we're dealing with numerics. I'm doing another pass through the code this morning, so I'll comment on it in place. Thanks for this follow up!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817552565



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,99 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    posixct_date = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"),
+    integer_var = c(32L, NA))
+
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used. This is due to the way testthat runs and
+  # normal use of arrow would not have to do this explicitly.
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date = date(posixct_date)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date_base = as.Date(posixct_date)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(date_from_r_object = date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(as_date_from_r_object = as.Date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  # date from integer supported in arrow (similar to base::as.Date()), but in
+  # Arrow it assumes a fixed origin "1970-01-01". However this is not supported
+  # by lubridate. lubridate::date(integer_var) errors without an `origin`
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      select(integer_var) %>%
+      mutate(date_int = date(integer_var)) %>%
+      collect(),
+    tibble(integer_var = c(32L, NA),
+           date_int = as.Date(c("1970-02-02", NA)))
+  )
+})
+
+test_that("date() errors with unsupported inputs", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56,
+    boolean_var = TRUE
+  )
+
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_char = date(character_ymd_var)) %>%
+      collect(),
+    regexp = "Unsupported cast from string to date32 using function cast_date32"
+  )
+
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_bool = date(boolean_var)) %>%
+      collect(),
+    regexp = "Unsupported cast from bool to date32 using function cast_date32"
+  )
+
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = date(double_var)) %>%
+      collect(),
+    regexp = "Unsupported cast from double to date32 using function cast_date32"
+  )
+

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817587976



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,99 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    posixct_date = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"),
+    integer_var = c(32L, NA))
+
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used. This is due to the way testthat runs and
+  # normal use of arrow would not have to do this explicitly.
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date = date(posixct_date)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date_base = as.Date(posixct_date)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(date_from_r_object = date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(as_date_from_r_object = as.Date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  # date from integer supported in arrow (similar to base::as.Date()), but in
+  # Arrow it assumes a fixed origin "1970-01-01". However this is not supported
+  # by lubridate. lubridate::date(integer_var) errors without an `origin`
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      select(integer_var) %>%
+      mutate(date_int = date(integer_var)) %>%
+      collect(),
+    tibble(integer_var = c(32L, NA),
+           date_int = as.Date(c("1970-02-02", NA)))
+  )
+})
+
+test_that("date() errors with unsupported inputs", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56,
+    boolean_var = TRUE
+  )
+
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_char = date(character_ymd_var)) %>%
+      collect(),
+    regexp = "Unsupported cast from string to date32 using function cast_date32"
+  )
+
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_bool = date(boolean_var)) %>%
+      collect(),
+    regexp = "Unsupported cast from bool to date32 using function cast_date32"
+  )
+
+  expect_error(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = date(double_var)) %>%
+      collect(),
+    regexp = "Unsupported cast from double to date32 using function cast_date32"
+  )

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r810919455



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -711,3 +711,23 @@ test_that("am/pm mirror lubridate", {
   )
 
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date

Review comment:
       As far as I can tell, that is correct. I never had this issue in an interactive session, after attaching lubridate. I don't think we will be 100% sure until we support namespacing with `pkg::function()` syntax. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r811830139



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -147,5 +147,7 @@ register_bindings_datetime <- function() {
   register_binding("pm", function(x) {
     !call_binding("am", x)
   })
-
+  register_binding("date", function(x) {
+    x$cast(date32())
+  })

Review comment:
       > Not sure what we want to achieve by calling a binding (in this case "date") on an R object
   
   I may be misinterpreting this, but I think maybe what Jon is getting at is whether you could do something like this in an arrow-dplyr chain:
   
   ```
   library(dplyr)
   library(lubridate)
   
   arbitrary_date_object <- ymd_hms("2012-03-26 23:12:13") + days(1:32)
   
   mtcars %>%
     mutate(x = date(arbitrary_date_object))
   
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r814899081



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,53 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    # browser()
+    if (call_binding("is.Date", x)) {
+      # arrow_date <- build_expr("cast", x, options = cast_options(to_type = date32()))
+      return(x)
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        arrow_timestamp <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (!is.null(format)) {
+        arrow_timestamp <- call_binding("strptime", x, format, unit = "s")
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {

Review comment:
       I suspect we actually will need to do something about these as it's causing failures elsewhere: https://github.com/apache/arrow/runs/5335594462?check_suite_focus=true#step:7:25913
   
   We could probably get away with defaulting the format to `%Y-%m-%d` and note that we don't (yet) support `%Y/%m/%d` as a default.
   
   Another option might be to identify if we're working with an R object or an Arrow object, and specifically call `base::as.Date()` if it's an R object. (this approach might actually be even broader and useful elsewhere for cases like this). For single-values this should be fine because converting from an R object of length 1 and then doing an arrow computation is not much faster than doing a computation on an R object of length 1 and then doing the R->Arrow conversion. 

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,53 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    # browser()

Review comment:
       I know you'll take care of these, but wanted to flag them since they don't pop up too noisily 

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,53 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    # browser()
+    if (call_binding("is.Date", x)) {
+      # arrow_date <- build_expr("cast", x, options = cast_options(to_type = date32()))
+      return(x)
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        arrow_timestamp <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (!is.null(format)) {
+        arrow_timestamp <- call_binding("strptime", x, format, unit = "s")
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` without `format` is not supported in Arrow")
+      }
+
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting
+        x <- call_binding("floor", x)
+        x <- build_expr("cast", x, options = (cast_options(to_type = int32())))
+      }
+      if (origin == "1970-01-01") {
+        return(build_expr("cast", x, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")

Review comment:
       This approach of erroring if the origin is not the epoch is good. It diverges slightly from what R does, but that's ok.
   
   As a possible extension, when we have addition, we could (probably?) support different origins by getting the delta between the epoch and provided origin and then doing the arithmetic. Or we could see if there's plans to add an origin argument at the C++ level. Either way, neither needs to be solved for this!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r815189581



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,53 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    # browser()
+    if (call_binding("is.Date", x)) {
+      # arrow_date <- build_expr("cast", x, options = cast_options(to_type = date32()))
+      return(x)
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        arrow_timestamp <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (!is.null(format)) {
+        arrow_timestamp <- call_binding("strptime", x, format, unit = "s")
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` without `format` is not supported in Arrow")
+      }
+
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting
+        x <- call_binding("floor", x)
+        x <- build_expr("cast", x, options = (cast_options(to_type = int32())))
+      }
+      if (origin == "1970-01-01") {
+        return(build_expr("cast", x, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")

Review comment:
       Yep, I do have at least a couple of follow-up tickets in mind. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816625715



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }

Review comment:
       I introduced that step (converting a character `x` to `Expression`) as a way of dealing with the errors we were seeing with `filter()`. In `filter.arrow_dplyr_query()` we get the following error after tidy-eval-ing the filter expression:
   ```r
   > filters <- lapply(filts, arrow_eval, arrow_mask(.data))
   > filters
   [[1]]
   [1] "Expression arguments must be Expression objects"
   attr(,"class")
   [1] "try-error"
   attr(,"condition")
   <assertError: Expression arguments must be Expression objects>
   ``` 
   
   where `filts` is:
   ```r
   > filts
   <list_of<quosure>>
   
   [[1]]
   <quosure>
   expr: ^ts >= as.Date("2015-05-04")
   env:  0x297eaa768
   ``` 
   
   and the original test is:
   ```r
   # ds is an open dataset
   ds %>%
         filter(ts >= as.Date("2015-05-04")) %>%
         filter(part == 1) %>%
         select(ts) %>%
         collect()
   ```
   
   As far as I understand it, it fails due to `"2015-05-04"` being a character and not an `Expression` object. Inside the `filter. ...` method:
   ```r
   > b <- arrow_eval(filts[[1]], arrow_mask(.data))
   > b
   [1] "Expression arguments must be Expression objects"
   attr(,"class")
   [1] "try-error"
   attr(,"condition")
   <assertError: Expression arguments must be Expression objects>
   ```
   Hence my approach: if we somehow get to that point in the function body and, for some reason, character `x` is not yet an `Expression, make it into one. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816834237



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }
+      interim_x <- call_binding("strptime", x, format, unit = "s")
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting

Review comment:
       Thanks. That was my reasoning for going with `floor` (instead of `round`, before being aware of `safe = FALSE`). Would you mind commenting at [ARROW-15798](https://issues.apache.org/jira/browse/ARROW-15798) in support of `float -> date`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r815947864



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,53 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    # browser()
+    if (call_binding("is.Date", x)) {
+      # arrow_date <- build_expr("cast", x, options = cast_options(to_type = date32()))
+      return(x)
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        arrow_timestamp <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (!is.null(format)) {
+        arrow_timestamp <- call_binding("strptime", x, format, unit = "s")
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` without `format` is not supported in Arrow")
+      }
+
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting
+        x <- call_binding("floor", x)
+        x <- build_expr("cast", x, options = (cast_options(to_type = int32())))
+      }
+      if (origin == "1970-01-01") {
+        return(build_expr("cast", x, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")

Review comment:
       [ARROW-15799](https://issues.apache.org/jira/browse/ARROW-15799) to revisit `as.Date()` and update the handling of the `origin` argument. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg edited a comment on pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg edited a comment on pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#issuecomment-1054504948


   I have created an umbrella issue ([ARROW-15805](https://issues.apache.org/jira/browse/ARROW-15805)) for the possible improvements to the `as.Date()` binding + [ARROW-15800](https://issues.apache.org/jira/browse/ARROW-15800) to implement `as_date()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817574074



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use `floor` before casting. `floor` is also a bit safer than
+        # int32() with `safe = FALSE` since it doesn't switch to `ceiling` for
+        # negative numbers
+        # TODO revisit if arrow decides to support double -> date casting
+        x <- call_binding("floor", x)
+        x <- build_expr("cast", x, options = cast_options(to_type = int32()))
+      }
+      if (origin != "1970-01-01") {
+        abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")
+      }
+    }

Review comment:
       Done. Moved all args checks at the top. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#issuecomment-1054504948


   I have created an umbrella issue ([ARROW-15805](https://issues.apache.org/jira/browse/ARROW-15805)) for the possible improvements to the `as.Date()` binding. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r818525339



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use `floor` before casting. `floor` is also a bit safer than
+        # int32() with `safe = FALSE` since it doesn't switch to `ceiling` for
+        # negative numbers
+        # TODO revisit if arrow decides to support double -> date casting
+        x <- call_binding("floor", x)
+        x <- build_expr("cast", x, options = cast_options(to_type = int32()))
+      }
+      if (origin != "1970-01-01") {
+        abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")
+      }
+    }

Review comment:
       I think you're right. However, base R doesn't complain if you provide arguments that are not necessary. Most are passed around via `...`.
   ``` r
   as.Date("1981-01-01", origin = "1970-01-02")
   #> [1] "1981-01-01"
   ```
   
   <sup>Created on 2022-03-03 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r818576865



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # actual and expected differ due to doubles are accounted for (floored in
+  # arrow and rounded to the next decimal in R)
+  expect_error(
+    compare_dplyr_binding(
+      .input %>%
+        mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+        collect(),
+      test_df
+    )
+  )

Review comment:
       If we go via `date64()` the roundtrip gets us a date-time back.
   ``` r
   library(dplyr, warn.conflicts = FALSE)
   library(arrow, warn.conflicts = FALSE)
   #> See arrow_info() for available features
   
   a <- Array$create(34.56)
   (a*86400*1000)$cast(int64())$cast(date64())
   #> Array
   #> <date64[ms]>
   #> [
   #>   1970-02-04
   #> ]
   
   df <- tibble::tibble(
     a = 34.56
   )
   
   df %>% 
     mutate(b = as.Date(a, origin = "1970-01-01"))
   #> # A tibble: 1 × 2
   #>       a b         
   #>   <dbl> <date>    
   #> 1  34.6 1970-02-04
   
   df %>% 
     arrow_table() %>% 
     mutate(b = as.Date(a, origin = "1970-01-01")) %>% 
     collect()
   #> # A tibble: 1 × 2
   #>       a b         
   #>   <dbl> <dttm>    
   #> 1  34.6 1970-02-04 14:26:24
   ```
   
   <sup>Created on 2022-03-03 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817909766



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use `floor` before casting. `floor` is also a bit safer than
+        # int32() with `safe = FALSE` since it doesn't switch to `ceiling` for
+        # negative numbers
+        # TODO revisit if arrow decides to support double -> date casting
+        x <- call_binding("floor", x)
+        x <- build_expr("cast", x, options = cast_options(to_type = int32()))
+      }
+      if (origin != "1970-01-01") {
+        abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")
+      }
+    }

Review comment:
       Like I mentioned above, I honestly hadn't caught that this was only happening with numerics. We could put this at the top of the numeric if clause, that would be fine (or we could have a two-condition if at the top to check it). Though if someone _does_ provide an origin for anything other than numerics we probably should error since that's not going to do anything for them. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817549370



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use `floor` before casting. `floor` is also a bit safer than
+        # int32() with `safe = FALSE` since it doesn't switch to `ceiling` for
+        # negative numbers
+        # TODO revisit if arrow decides to support double -> date casting
+        x <- call_binding("floor", x)
+        x <- build_expr("cast", x, options = cast_options(to_type = int32()))
+      }
+      if (origin != "1970-01-01") {
+        abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")
+      }
+    }

Review comment:
       While I generally agree that argument validation should happen at the top of a function. I'm on the fence on this one. The origin argument is only relevant when trying to coerce a numeric, so I'm a bit reluctant to move it outside the `numeric` block. In regular R definitions, each block corresponding to a type would be a distinct method. Method-specific arguments would be checked on a per-method basis, wouldn't they?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816352943



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,94 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done

Review comment:
       We also might want to add that this is only due to the way that testthat runs – that any normal use of arrow will not have to do this explicitly.

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }
+      interim_x <- call_binding("strptime", x, format, unit = "s")
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting
+        # TODO revisit if arrow decides to support double -> date casting
+        x <- call_binding("floor", x)
+        interim_x <- build_expr("cast", x, options = (cast_options(to_type = int32())))
+      } else {
+        interim_x <- x
+      }

Review comment:
       Is it possible to overwrite `x` instead of using a separate object? That would let us get rid of this `else` clause, at least. Which might make this set of nested if/elses a bit easier to parse

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }

Review comment:
       Could you describe a little bit more about what cases this is catching? 
   
   I ask because I suspect that this is catching with the message we are testing:
   
   ```
   expect_error(
       test_df %>%
         arrow_table() %>%
         mutate(date_char = date(character_ymd_var)) %>%
         collect(),
       regexp = "Unsupported cast from string to date32 using function cast_date32"
     )
   ```
   
   I was expecting something like this: 
   
   ```
   > tab <- arrow_table(data.frame(time = "2020-01-01 00:01:00", date = "2020-01-01"))
   > tab %>% mutate(strptime(time, format = "%Y-%m-%d")) %>% collect()
   Error in `handle_csv_read_error()`:
   ! Invalid: Failed to parse string: '2020-01-01 00:01:00' as a scalar of type timestamp[ms]
   Run `rlang::last_error()` to see where the error occurred.
   ```
   
   So I suspect that this `if()` is catching this before that, which I'm a little surprised by. So anyway, what cases are we expecting here?

##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,94 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date = date(a)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date_base = as.Date(a)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(b = date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(b_base = as.Date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+})
+
+test_that("date() errors with unsupported inputs", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56,
+    boolean_var = TRUE
+  )
+
+  # date from integer supported in arrow (similar to base::as.Date()), but in
+  # Arrow it assumes a fixed origin "1970-01-01"
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      select(integer_var) %>%
+      mutate(date_int = date(integer_var)) %>%
+      collect(),
+    tibble(integer_var = 32L,
+           date_int = as.Date("1970-02-02"))
+  )

Review comment:
       This is minor, but should this be in this block? It seems a bit odd to have a block labeled "date() errors with unsupported inputs" that has an `expect_equal()` without any error in it

##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,94 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))

Review comment:
       I know we don't have to differentiate things here, but would it be possible to name this something more descriptive than `a` regardless? 

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }
+      interim_x <- call_binding("strptime", x, format, unit = "s")
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting

Review comment:
       Using `floor` here is fine, but you should dig into how|why you get the error you're talking about here — there is a way around it if you wanted to do that!

##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }

Review comment:
       And separately — we should make a Jira if one doesn't already exist to see if we can allow `strptime(time, format = "%Y-%m-%d")` to work (which does in R)
   
   ```
   > strptime("2020-01-01 00:01:00", "%Y-%m-%d")
   [1] "2020-01-01 CST"
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r815951089



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,53 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    # browser()

Review comment:
       Done in a prev commit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816657917



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }

Review comment:
       > And separately — we should make a Jira if one doesn't already exist to see if we can allow `strptime(time, format = "%Y-%m-%d")` to work (which does in R)
   > 
   > ```
   > > strptime("2020-01-01 00:01:00", "%Y-%m-%d")
   > [1] "2020-01-01 CST"
   > ```
   
   Done. [ARROW-15813](https://issues.apache.org/jira/browse/ARROW-15813)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816672986



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }
+      interim_x <- call_binding("strptime", x, format, unit = "s")
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting

Review comment:
       Is `safe = FALSE` what you had in mind?
   ``` r
   suppressPackageStartupMessages(library(arrow))
   a <- Array$create(32.45)
   a$cast(int32())
   #> Error: Invalid: Float value 32.45 was truncated converting to int32
   #> /Users/dragos/Documents/arrow/cpp/src/arrow/compute/kernels/scalar_cast_numeric.cc:177  CheckFloatToIntTruncation(batch[0], *out)
   #> /Users/dragos/Documents/arrow/cpp/src/arrow/compute/exec.cc:700  kernel_->exec(kernel_ctx_, batch, &out)
   #> /Users/dragos/Documents/arrow/cpp/src/arrow/compute/exec.cc:641  ExecuteBatch(batch, listener)
   #> /Users/dragos/Documents/arrow/cpp/src/arrow/compute/function.cc:255  executor->Execute(implicitly_cast_args, &listener)
   
   a$cast(int32(), safe = FALSE)
   #> Array
   #> <int32>
   #> [
   #>   32
   #> ]
   ```
   
   <sup>Created on 2022-03-01 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   Ideally we would support float -> date casting ([ARROW-15807](https://issues.apache.org/jira/browse/ARROW-15807) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#issuecomment-1058320886


   Benchmark runs are scheduled for baseline = ce46c1adf10654248b02c349210bb63ecc6828ad and contender = 9719eae66dcf38c966ae769215d27020a6dd5550. 9719eae66dcf38c966ae769215d27020a6dd5550 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/f23d468014004a348cc275618eb8dbd7...0caa28678d194dc0b298bad2a2b53ca3/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b4207e561b6447d984f6e26c65ee020d...f85c32d499a944df8a229a51f0aa8efb/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2dd547de0ff44aeab70df443df281742...5a56a8e391904fa38b23a33f43ab61d6/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/1595f8e1d2c84a83928b04b87805d46c...8ea69c725c34424e867ffbbafe2533e4/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r818787400



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -767,4 +768,81 @@ test_that("nested structs can be created from scalars and existing data frames",
       collect(),
     tibble(a = 1:2)
   )
+
+  })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # currently we do not support an origin different to "1970-01-01"
+  expect_warning(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_int = as.Date(integer_var, origin = "1970-01-03")) %>%
+      collect(),
+    regexp = "`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow",
+    fixed = TRUE
+  )

Review comment:
       Done. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#issuecomment-1058320886


   Benchmark runs are scheduled for baseline = ce46c1adf10654248b02c349210bb63ecc6828ad and contender = 9719eae66dcf38c966ae769215d27020a6dd5550. 9719eae66dcf38c966ae769215d27020a6dd5550 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/f23d468014004a348cc275618eb8dbd7...0caa28678d194dc0b298bad2a2b53ca3/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b4207e561b6447d984f6e26c65ee020d...f85c32d499a944df8a229a51f0aa8efb/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2dd547de0ff44aeab70df443df281742...5a56a8e391904fa38b23a33f43ab61d6/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/1595f8e1d2c84a83928b04b87805d46c...8ea69c725c34424e867ffbbafe2533e4/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817549370



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use `floor` before casting. `floor` is also a bit safer than
+        # int32() with `safe = FALSE` since it doesn't switch to `ceiling` for
+        # negative numbers
+        # TODO revisit if arrow decides to support double -> date casting
+        x <- call_binding("floor", x)
+        x <- build_expr("cast", x, options = cast_options(to_type = int32()))
+      }
+      if (origin != "1970-01-01") {
+        abort("`as.Date()` with an `origin` different than '1970-01-01' is not supported in Arrow")
+      }
+    }

Review comment:
       I'm on the fence on this one. The origin argument is only relevant when trying to coerce a numeric, so I'm a bit reluctant to move it outside the `numeric` block. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817579897



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # actual and expected differ due to doubles are accounted for (floored in
+  # arrow and rounded to the next decimal in R)
+  expect_error(
+    compare_dplyr_binding(
+      .input %>%
+        mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+        collect(),
+      test_df
+    )
+  )

Review comment:
       Initially I implemented it with `round()`, but this rounds up if the fractional part is greater than 0.5, which I don't think is what we want. I think `floor()` is the correct implementation.
   ``` r
   as.Date(34, origin = "1970-01-01") 
   #> [1] "1970-02-04"
   as.Date(34.56, origin = "1970-01-01") 
   #> [1] "1970-02-04"
   as.Date(round(34.56), origin = "1970-01-01")
   #> [1] "1970-02-05"
   as.Date(floor(34.56), origin = "1970-01-01")
   #> [1] "1970-02-04"
   ```
   
   <sup>Created on 2022-03-02 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r814964594



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,53 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    # browser()
+    if (call_binding("is.Date", x)) {
+      # arrow_date <- build_expr("cast", x, options = cast_options(to_type = date32()))
+      return(x)
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        arrow_timestamp <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (!is.null(format)) {
+        arrow_timestamp <- call_binding("strptime", x, format, unit = "s")
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {

Review comment:
       My plan was to revisit this once the `strptime()` kernel supports returning NAs and then `coalesce()` through the `tryFormats` `base::as.Date()` defaults. In the meantime, defaulting to `"%Y-%m-%d"` sounds like a good plan. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r814964594



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,53 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    # browser()
+    if (call_binding("is.Date", x)) {
+      # arrow_date <- build_expr("cast", x, options = cast_options(to_type = date32()))
+      return(x)
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        arrow_timestamp <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (!is.null(format)) {
+        arrow_timestamp <- call_binding("strptime", x, format, unit = "s")
+        return(build_expr("cast", arrow_timestamp, options = cast_options(to_type = date32())))
+      } else {

Review comment:
       My plan was to revisit this once the `strptime()` kernel supports returning NAs and then `coalesce()` through the `tryFormats` `base::as.Date()` defaults. In the meantime, defaulting to `"%Y-%m-%d"` sounds like a good plan. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817601442



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )

Review comment:
       Maybe they're remnants from one of the iterations. Removed. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#issuecomment-1040543051


   https://issues.apache.org/jira/browse/ARROW-14808


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#issuecomment-1040543051


   https://issues.apache.org/jira/browse/ARROW-14808


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r813328095



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -164,4 +164,10 @@ register_bindings_datetime <- function() {
       return(semester)
     }
   })
+  register_binding("date", function(x) {
+    if (inherits(x, "POSIXct")) {
+      x <- vec_to_Array(x, date32())
+    }
+    x$cast(date32())

Review comment:
       This implementation made me think of the various `as.*` methods we have defined [1] (since this is similar to `as.Date()`). Which all use a simpler setup to create a cast operation. However, I noticed that for those, they are using `Expression$create(...)` rather than the `build_expr(...)` helper [2]. That `build_expr(...)` here _should_ handle the wrapping of R objects into Scalars that you're doing in this first bit here. Can you try that out and confirm that it does what we are looking for with inputs that are both columns in an arrow table/recordbatch/dataset etc. as well as if it's a single value from R (i.e. `mutate(term_start = date("2020-01-01")`). 
   
   Separately: We should also open a jira if one doesn't exist to clean up those `as.*` methods to use `build_expr()`
   
   
   [1] https://github.com/apache/arrow/blob/a916e6015d7cab397d3b769d760b943a60ea3d97/r/R/dplyr-funcs-type.R#L45-L78
   [2] https://github.com/apache/arrow/blob/a916e6015d7cab397d3b769d760b943a60ea3d97/r/R/expression.R#L195-L259




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r813710773



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -164,4 +164,10 @@ register_bindings_datetime <- function() {
       return(semester)
     }
   })
+  register_binding("date", function(x) {
+    if (inherits(x, "POSIXct")) {
+      x <- vec_to_Array(x, date32())
+    }
+    x$cast(date32())

Review comment:
       [ARROW-15775](https://issues.apache.org/jira/browse/ARROW-15775) for cleaning up `as.*` methods




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r814632260



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,13 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x) {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    y <- build_expr("cast", x, options = cast_options(to_type = timestamp()))

Review comment:
       You are right, that was done to match the default behaviour of `as.Date()`, but we actually need to support a `tz` argument.  
   
   I think this needs a bit more work to support _all of_ (?) the following:
   ``` r
   a <- as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London")
   b <- "2022-02-25 00:00:01"
   c <- "2022/25/02 00:00:01"
   d <- 1234
   as.Date(a)
   #> [1] "2022-02-25"
   as.Date(a, tz = "Pacific/Marquesas")
   #> [1] "2022-02-24"
   as.Date(b)
   #> [1] "2022-02-25"
   as.Date(c, format = "%Y/%d/%m %H:%M:%S")
   #> [1] "2022-02-25"
   as.Date(d, origin = "1981-01-01")
   #> [1] "1984-05-19"
   ```
   
   <sup>Created on 2022-02-25 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup> 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r814346290



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,13 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x) {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    y <- build_expr("cast", x, options = cast_options(to_type = timestamp()))

Review comment:
       I assume that this line is depending on the default timezone for `timestamp()` to be UTC, yeah? Maybe we should be explicit about that here since to make it a little bit easier to read | reason about? 

##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,85 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date = date(a)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date_base = as.Date(a)) %>%
+      collect(),
+    test_df
+  )
+
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+  compare_dplyr_binding(
+    .input %>%
+      mutate(b = date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(b_base = as.Date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  skip("All these will fail as we're not actually forcing evaluation")
+  # a timestamp is cast correctly to date
+  expect_equal(
+    call_binding("date", Array$create(as.POSIXct("2022-02-21"))),
+    Array$create(as.POSIXct("2022-02-21"), type = date32())
+  )
+
+  # date() supports R objects
+  expect_equal(
+    call_binding("date", as.POSIXct("2022-02-21")),
+    Array$create(as.POSIXct("2022-02-21"), type = date32())
+  )
+})
+
+test_that("date() errors with unsupported inputs", {
+  skip("All these will fail as we're not actually forcing evaluation")
+  expect_error(
+    call_binding("date", Scalar$create("a string")),
+    "NotImplemented: Unsupported cast from string to date32 using function cast_date32"
+  )
+
+  expect_error(
+    call_binding("date", Scalar$create(32.2)),
+    "NotImplemented: Unsupported cast from double to date32 using function cast_date32"
+  )
+
+  expect_error(
+    arrow_eval(call_binding("date", Scalar$create(TRUE)), mask = arrow_mask(list())),
+    "NotImplemented: Unsupported cast from bool to date32 using function cast_date32"
+  )
+
+  # if we are aiming for equivalent behaviour to lubridate this should fail, but
+  # it doesn't as it is supported in arrow, where integer casting to date returns
+  # the date x days away from epoch
+  skip("supported in arrow, but not in lubridate")
+  expect_error(
+    call_binding("date", Scalar$create(32L)),
+    "NotImplemented: Unsupported cast from integer to date32 using function cast_date32"
+  )

Review comment:
       I would say that we should test this, even if we're diverging from dplyr. If it's easier to do this in a dplyr pipeline compared with evaluating the binding directly, feel free to do that.
   
   Interestingly(?) `lubridate::date()` doesn't outright reject numerics (though one needs to supply an origin in R, which does error). I wonder if there is an issue | if we should raise one with lubridate to either support that, or error with something more helpful. Though if lubridate did support this via POSIXlt, instead of being one day after the epoch, it is actually one second after the epoch(!):
   
   ```
   > as.POSIXlt(1, origin = as.Date("1970-01-01"), tz = "UTC")
   [1] "1970-01-01 00:00:01 UTC"
   ``` 
   
   Anyway TL;DR, having arrow support `date(1)` being one day after the epoch I think is totally fine, but we should make a test for that here that confirms that we expect that behavior.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r814751986



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,13 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x) {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    y <- build_expr("cast", x, options = cast_options(to_type = timestamp()))

Review comment:
       The though question here is how can we support date extraction / casting of a date with a different timezone, since arrow doesn't really support that at the moment (if I'm not mistaken). In the example above:
   ```r
   as.Date(a, tz = "Pacific/Marquesas") 
   #> [1] "2022-02-24"
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816633840



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }

Review comment:
       The additional casting step does not impact `date()` (the unit test you copied), only `as.Date()`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816684004



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }
+      interim_x <- call_binding("strptime", x, format, unit = "s")
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting

Review comment:
       Done. Switched to using `safe = FALSE` since I think it's a bit cleaner.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816810668



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }
+      interim_x <- call_binding("strptime", x, format, unit = "s")
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting

Review comment:
       Yeah, `safe` was what I was thinking. Like I said in my comment: I think `floor` is good for here since that is how dates from timestamps work anyway (they round down to the next whole unit, until the next one and don't start rounding up at noon or something).
   
   The unsafe cast rounds towards zero, so for negative numbers you are effectively getting `ceiling`. I suspect in reality no one will ever run into this (and I agree that arrow should support `float -> date`!), but on the off chance that someone does, I think `floor` here is the right way to go (maybe we should also add a comment that explains this reasoning so it's out in the open: We need to go from float -> int, and we want the behavior of `floor` too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817574436



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }

Review comment:
       Done. Move arg check at the top.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r810919455



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -711,3 +711,23 @@ test_that("am/pm mirror lubridate", {
   )
 
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date

Review comment:
       As far as I can tell, that is correct. I never had this issue in an interactive session after loading lubridate I don't think we will be 100% sure until we support namespacing with `pkg::function()` syntax. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r813721357



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -164,4 +164,10 @@ register_bindings_datetime <- function() {
       return(semester)
     }
   })
+  register_binding("date", function(x) {
+    if (inherits(x, "POSIXct")) {
+      x <- vec_to_Array(x, date32())
+    }
+    x$cast(date32())

Review comment:
       We do not have a separate ticket for `as.Date()`. I will try to tackle it here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r813721357



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -164,4 +164,10 @@ register_bindings_datetime <- function() {
       return(semester)
     }
   })
+  register_binding("date", function(x) {
+    if (inherits(x, "POSIXct")) {
+      x <- vec_to_Array(x, date32())
+    }
+    x$cast(date32())

Review comment:
       We do not have a separate ticket for `as.Date()`. I will try to tackle it in this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r811954177



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -147,5 +147,7 @@ register_bindings_datetime <- function() {
   register_binding("pm", function(x) {
     !call_binding("am", x)
   })
-
+  register_binding("date", function(x) {
+    x$cast(date32())
+  })

Review comment:
       Thanks, @thisisnic for that example. I think that sort of use of R objects is not supported at all in an _arrow_ pipeline (see my _reprex_ below). So neither the binding for `date()`, nor any of the other lubridate ones would work in that context. 
   
   ``` r
   suppressPackageStartupMessages(library(arrow))
   suppressPackageStartupMessages(library(dplyr))
   suppressPackageStartupMessages(library(lubridate))
   
   arbitrary_date_object <- ymd_hms("2012-03-26 23:12:13") + days(1:32)
   # dplyr pipeline
   mtcars %>% 
     mutate(x = arbitrary_date_object) %>% 
     head(4)
   #>                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
   #> Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
   #> Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
   #> Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
   #> Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
   #>                                  x
   #> Mazda RX4      2012-03-27 23:12:13
   #> Mazda RX4 Wag  2012-03-28 23:12:13
   #> Datsun 710     2012-03-29 23:12:13
   #> Hornet 4 Drive 2012-03-30 23:12:13
   
   # arrow pipeline
   mtcars %>% 
     arrow_table() %>% 
     mutate(x = arbitrary_date_object) %>% 
     collect() %>% 
     head(4)
   #> Warning: In x = arbitrary_date_object, only values of size one are recycled;
   #> pulling data into R
   #>    mpg cyl disp  hp drat    wt  qsec vs am gear carb                   x
   #> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 2012-03-27 23:12:13
   #> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 2012-03-28 23:12:13
   #> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 2012-03-29 23:12:13
   #> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 2012-03-30 23:12:13
   ```
   
   <sup>Created on 2022-02-22 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r812745417



##########
File path: r/R/dplyr-funcs-datetime.R
##########
@@ -147,5 +147,7 @@ register_bindings_datetime <- function() {
   register_binding("pm", function(x) {
     !call_binding("am", x)
   })
-
+  register_binding("date", function(x) {
+    x$cast(date32())
+  })

Review comment:
       Recycling a vector of length one works. 
   ``` r
   suppressPackageStartupMessages(library(arrow))
   suppressPackageStartupMessages(library(dplyr))
   suppressPackageStartupMessages(library(lubridate))
   
   arbitrary_date_object <- ymd_hms("2012-03-26 23:12:13")
   # dplyr pipeline
   mtcars %>% 
     mutate(x = arbitrary_date_object) %>% 
     head(4)
   #>                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
   #> Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
   #> Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
   #> Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
   #> Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
   #>                                  x
   #> Mazda RX4      2012-03-26 23:12:13
   #> Mazda RX4 Wag  2012-03-26 23:12:13
   #> Datsun 710     2012-03-26 23:12:13
   #> Hornet 4 Drive 2012-03-26 23:12:13
   
   # arrow pipeline
   mtcars %>% 
     arrow_table() %>% 
     mutate(x = arbitrary_date_object) %>% 
     collect() %>% 
     head(4)
   #>    mpg cyl disp  hp drat    wt  qsec vs am gear carb                   x
   #> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 2012-03-26 23:12:13
   #> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 2012-03-26 23:12:13
   #> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 2012-03-26 23:12:13
   #> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 2012-03-26 23:12:13
   ```
   
   <sup>Created on 2022-02-23 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r818594862



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # actual and expected differ due to doubles are accounted for (floored in
+  # arrow and rounded to the next decimal in R)
+  expect_error(
+    compare_dplyr_binding(
+      .input %>%
+        mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+        collect(),
+      test_df
+    )
+  )

Review comment:
       We could go `double*86400*1000 -> int64() -> date64() -> date32(with safe = FALSE)`, but this would be just as bad as `floor()`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r818673835



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d %H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m %H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # actual and expected differ due to doubles are accounted for (floored in
+  # arrow and rounded to the next decimal in R)
+  expect_error(
+    compare_dplyr_binding(
+      .input %>%
+        mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+        collect(),
+      test_df
+    )
+  )

Review comment:
       ``` r
   library(arrow, warn.conflicts = FALSE)
   #> See arrow_info() for available features
   a <- Array$create(34.56)
   (a*86400*1000)$cast(int64())$cast(date64())$cast(date32(), safe = FALSE)
   #> Array
   #> <date32[day]>
   #> [
   #>   1970-02-04
   #> ]
   (a*86400*1000)$cast(int64())$cast(date64())$cast(date32(), safe = FALSE)$cast(int32())
   #> Array
   #> <int32>
   #> [
   #>   34
   #> ]
   ```
   
   <sup>Created on 2022-03-03 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r814751986



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,13 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x) {
+    # base::as.Date() first converts to UTC and then extracts the date, which is
+    # why we need to go through timestamp() first - see unit tests for the real
+    # life impact of the difference between lubridate::date() and base::as.Date()
+    y <- build_expr("cast", x, options = cast_options(to_type = timestamp()))

Review comment:
       The though question here is how can we support date extraction / casting of a date with a different timezone, since arrow doesn't really support that at the moment (if I'm not mistaken). In the example above:
   ```r
   as.Date(a, tz = "Pacific/Marquesas") #> [1] "2022-02-24"
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r814876396



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,85 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+
+  # we can't (for now) use namespacing, so we need to make sure lubridate::date()
+  # and not base::date() is being used
+  # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+  date <- lubridate::date
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date = date(a)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(a_date_base = as.Date(a)) %>%
+      collect(),
+    test_df
+  )
+
+  r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+  compare_dplyr_binding(
+    .input %>%
+      mutate(b = date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  compare_dplyr_binding(
+    .input %>%
+      mutate(b_base = as.Date(r_date_object)) %>%
+      collect(),
+    test_df
+  )
+
+  skip("All these will fail as we're not actually forcing evaluation")
+  # a timestamp is cast correctly to date
+  expect_equal(
+    call_binding("date", Array$create(as.POSIXct("2022-02-21"))),
+    Array$create(as.POSIXct("2022-02-21"), type = date32())
+  )
+
+  # date() supports R objects
+  expect_equal(
+    call_binding("date", as.POSIXct("2022-02-21")),
+    Array$create(as.POSIXct("2022-02-21"), type = date32())
+  )
+})
+
+test_that("date() errors with unsupported inputs", {
+  skip("All these will fail as we're not actually forcing evaluation")
+  expect_error(
+    call_binding("date", Scalar$create("a string")),
+    "NotImplemented: Unsupported cast from string to date32 using function cast_date32"
+  )
+
+  expect_error(
+    call_binding("date", Scalar$create(32.2)),
+    "NotImplemented: Unsupported cast from double to date32 using function cast_date32"
+  )
+
+  expect_error(
+    arrow_eval(call_binding("date", Scalar$create(TRUE)), mask = arrow_mask(list())),
+    "NotImplemented: Unsupported cast from bool to date32 using function cast_date32"
+  )
+
+  # if we are aiming for equivalent behaviour to lubridate this should fail, but
+  # it doesn't as it is supported in arrow, where integer casting to date returns
+  # the date x days away from epoch
+  skip("supported in arrow, but not in lubridate")
+  expect_error(
+    call_binding("date", Scalar$create(32L)),
+    "NotImplemented: Unsupported cast from integer to date32 using function cast_date32"
+  )

Review comment:
       Arrow behaviour is aligned with `base::as.Date()` - which still needs an `origin`, while arrow only supports an implicit origin of `1970-01-01`. I think we can work with that - see how I implemented it in the `as.Date()` binding.
   ``` r
   as.Date(32L, origin = "1970-01-01")
   #> [1] "1970-02-02"
   ```
   
   <sup>Created on 2022-02-25 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   But **yes**, we do diverge from lubridate.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817573838



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,60 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816591366



##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,94 @@ test_that("dst extracts daylight savings time correctly", {
     test_df
   )
 })
+
+test_that("date works in arrow", {
+  # https://issues.apache.org/jira/browse/ARROW-13168
+  skip_on_os("windows")
+  # this date is specific since lubridate::date() is different from base::as.Date()
+  # since as.Date returns the UTC date and date() doesn't
+  test_df <- tibble(
+    a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] dragosmg commented on a change in pull request #12433: ARROW-14808 [R] Implement bindings for `lubridate::date()`

Posted by GitBox <gi...@apache.org>.
dragosmg commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r816672986



##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,64 @@ register_bindings_type_cast <- function() {
   register_binding("as.numeric", function(x) {
     Expression$create("cast", x, options = cast_options(to_type = float64()))
   })
+  register_binding("as.Date", function(x,
+                                       format = NULL,
+                                       tryFormats = "%Y-%m-%d",
+                                       origin = "1970-01-01",
+                                       tz = "UTC") {
+
+    if (call_binding("is.Date", x)) {
+      # base::as.Date() first converts to the desired timezone and then extracts
+      # the date, which is why we need to go through timestamp() first
+      return(x)
+
+    # cast from POSIXct
+    } else if (call_binding("is.POSIXct", x)) {
+      if (tz == "UTC") {
+        interim_x <- build_expr("cast", x, options = cast_options(to_type = timestamp(timezone = tz)))
+      } else {
+        abort("`as.Date()` with a timezone different to 'UTC' is not supported in Arrow")
+      }
+
+    # cast from character
+    } else if (call_binding("is.character", x)) {
+      # this could be improved with tryFormats once strptime returns NA and we
+      # can use coalesce - https://issues.apache.org/jira/browse/ARROW-15659
+      # TODO revisit once https://issues.apache.org/jira/browse/ARROW-15659 is done
+      if (is.null(format)) {
+        if (length(tryFormats) == 1) {
+          format <- tryFormats[1]
+        } else {
+          abort("`as.Date()` with multiple `tryFormats` is not supported in Arrow yet")
+        }
+      }
+      # if x is not an expression (e.g. passed as filter), convert it to one
+      if (!inherits(x, "Expression")) {
+        x <- build_expr("cast", x, options = cast_options(to_type = type(x)))
+      }
+      interim_x <- call_binding("strptime", x, format, unit = "s")
+
+    # cast from numeric
+    } else if (call_binding("is.numeric", x)) {
+      # the origin argument will be better supported once we implement temporal
+      # arithmetic (https://issues.apache.org/jira/browse/ARROW-14947)
+      # TODO revisit once the above has been sorted
+      if (!call_binding("is.integer", x)) {
+        # Arrow does not support direct casting from double to date so we have
+        # to convert to integers first - casting to int32() would error so we
+        # need to use round before casting

Review comment:
       Is `safe = FALSE` what you had in mind?
   ``` r
   suppressPackageStartupMessages(library(arrow))
   a <- Array$create(32.45)
   a$cast(int32())
   #> Error: Invalid: Float value 32.45 was truncated converting to int32
   #> /Users/dragos/Documents/arrow/cpp/src/arrow/compute/kernels/scalar_cast_numeric.cc:177  CheckFloatToIntTruncation(batch[0], *out)
   #> /Users/dragos/Documents/arrow/cpp/src/arrow/compute/exec.cc:700  kernel_->exec(kernel_ctx_, batch, &out)
   #> /Users/dragos/Documents/arrow/cpp/src/arrow/compute/exec.cc:641  ExecuteBatch(batch, listener)
   #> /Users/dragos/Documents/arrow/cpp/src/arrow/compute/function.cc:255  executor->Execute(implicitly_cast_args, &listener)
   
   a$cast(int32(), safe = FALSE)
   #> Array
   #> <int32>
   #> [
   #>   32
   #> ]
   ```
   
   <sup>Created on 2022-03-01 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   Ideally we would support float -> date casting ([ARROW-15807](https://issues.apache.org/jira/browse/ARROW-15807)) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org