You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Carl Boettiger (Jira)" <ji...@apache.org> on 2022/09/30 22:17:00 UTC
[jira] [Created] (ARROW-17905) [R] as_date and similar methods fail with digit seconds

Carl Boettiger created ARROW-17905:
--------------------------------------

             Summary: [R] as_date and similar methods fail with digit seconds
                 Key: ARROW-17905
                 URL: https://issues.apache.org/jira/browse/ARROW-17905
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 9.0.0
            Reporter: Carl Boettiger


Arrow 9.0  R client introduced support for dates with lubridate (and base R as.Date()) functions, which is awesome. 

However, these functions fail to handle decimal dates.  This will especially confuse R users because the native R functions work as expected, and R users will not realize the metaprogramming translation.  Easiest to see this in a minimal reprex:



{code:java}
library(arrow); library(lubridate); library(dplyr){code}
{code:java}
f <- tempfile()
data.frame(t = Sys.time(), A = 1) |>
  write_dataset(f, partitioning = "t")

# ERRORS
open_dataset(f) |> mutate(as_date(t)) |> collect() {code}

This errors with message:


{code:java}
open_dataset(f) |> mutate(as_date(t)) |> collect()
Error in `collect()`:
! Invalid: Failed to parse string: '2022-09-30 22:03:32.123248' as a scalar of type timestamp[s] {code}
Which is strange because lubridate::as_date('2022-09-30 22:03:32.123248') works fine.  

It's easy to see the cause of the error prior to collect:


{code:java}
as_date(t): date32[day] (cast(strptime(t, {format="%Y-%m-%d", unit=SECOND, error_is_null=false}), {to_type=date32[day], allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})){code}

We can see a lot of assumptions there about units of parsing, but afaik from R we have no way to control them.  The issue is particularly ironic because as you see in my example, the column has only become a string because we used it as a partition.  So arrow coerced the timestamp to a string originally (using microsecond precision – which is an understandable choice because it is loss-less, though it is different from R's as.character() behavior).  But ironically, now arrow doesn't understand how to reverse it's own timestamp->string behavior to get a back to a timestamp!  

Ideally the user would have more control of these, and the default assumptions would be consistent.  Ideally, as_datetime, as_date, etc should not choke regardless of the precision of the seconds, matching the existing behavior of the base R (as.Date etc) and lubridate functions. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)