You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Carl Boettiger (Jira)" <ji...@apache.org> on 2022/10/01 00:34:00 UTC

[jira] [Commented] (ARROW-17905) [R] as_date and similar methods fail with digit seconds

    [ https://issues.apache.org/jira/browse/ARROW-17905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611796#comment-17611796 ] 

Carl Boettiger commented on ARROW-17905:
----------------------------------------

Very cool.  I actually tried to do that but can't figure out the syntax for dplyr.  looks like I should even be able to pass dplyr the literal cast() command and modify the units or truncation options, but couldn't get that one to work, or any of the functions listed in `list_compute_functions()`.  (e.g. I tried:

 
{code:java}
open_dataset(f) |> mutate(t = arrow_ascii_ltrim(t,10)) |> collect()
{code}
  I did get it to work using  substr(),


{code:java}
open_dataset(f) |> mutate(t = substr(t,1,10)) |> collect() {code}
which kinda surprised me because substr wasn't listed in list_compute_functions(), and most other base or dplyr verbs that trim strings failed.  (e.g. strtrim() isn't recognized, nor is stringr::str_trim() ).  

Is there a list of what R functions like substr() and as_date() that arrow understands?  

(Also would be great to have more examples of using the compute functions with dplyr)

Anyway thanks! I'll keep an eye on the issue you mentioned.  We can close this one out.  

 

> [R] as_date and similar methods fail with digit seconds
> -------------------------------------------------------
>
>                 Key: ARROW-17905
>                 URL: https://issues.apache.org/jira/browse/ARROW-17905
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 9.0.0
>            Reporter: Carl Boettiger
>            Priority: Major
>
> Arrow 9.0  R client introduced support for dates with lubridate (and base R as.Date()) functions, which is awesome. 
> However, these functions fail to handle decimal dates.  This will especially confuse R users because the native R functions work as expected, and R users will not realize the metaprogramming translation.  Easiest to see this in a minimal reprex:
> {code:java}
> library(arrow); library(lubridate); library(dplyr){code}
> {code:java}
> f <- tempfile()
> data.frame(t = Sys.time(), A = 1) |>
>   write_dataset(f, partitioning = "t")
> # ERRORS
> open_dataset(f) |> mutate(as_date(t)) |> collect() {code}
> This errors with message:
> {code:java}
> open_dataset(f) |> mutate(as_date(t)) |> collect()
> Error in `collect()`:
> ! Invalid: Failed to parse string: '2022-09-30 22:03:32.123248' as a scalar of type timestamp[s] {code}
> Which is strange because lubridate::as_date('2022-09-30 22:03:32.123248') works fine.  
> It's easy to see the cause of the error prior to collect:
> {code:java}
> as_date(t): date32[day] (cast(strptime(t, {format="%Y-%m-%d", unit=SECOND, error_is_null=false}), {to_type=date32[day], allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})){code}
> We can see a lot of assumptions there about units of parsing, but afaik from R we have no way to control them.  The issue is particularly ironic because as you see in my example, the column has only become a string because we used it as a partition.  So arrow coerced the timestamp to a string originally (using microsecond precision – which is an understandable choice because it is loss-less, though it is different from R's as.character() behavior).  But ironically, now arrow doesn't understand how to reverse it's own timestamp->string behavior to get a back to a timestamp!  
> Ideally the user would have more control of these, and the default assumptions would be consistent.  Ideally, as_datetime, as_date, etc should not choke regardless of the precision of the seconds, matching the existing behavior of the base R (as.Date etc) and lubridate functions. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)