You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/12/16 01:35:00 UTC
[jira] [Comment Edited] (ARROW-15124) [R] default TZ parsing woes in CSV reader

    [ https://issues.apache.org/jira/browse/ARROW-15124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460352#comment-17460352 ] 

Weston Pace edited comment on ARROW-15124 at 12/16/21, 1:34 AM:
----------------------------------------------------------------

Are "2021-02-01T00:00:00Z" and "2021-02-01" the same instant? (readr seems to consider these the same instant in time)

Would "2021-02-01T00:00:00-07:00" and "2021-02-01" be the same instant? (readr does not seem to consider these the same instant in time)

In other words should values without a datetime be blindly assumed to be UTC?  Or should a new CSV reader argument be added (the time zone to be used when one isn't present).


was (Author: westonpace):
Are "2021-02-01T00:00:00Z" and "2021-02-01" the same instant? (readr seems to consider these the same instant in time)

Would "2021-02-01T00:00:00-07:00" and "2021-02-01" be the same instant? (readr does not seem to consider these the same instant in time)

In other words should values without a datetime be blindly assumed to be UTC?  Or should a new CSV reader argument be added (the timestamp to be used when one isn't present).

> [R] default TZ parsing woes in CSV reader
> -----------------------------------------
>
>                 Key: ARROW-15124
>                 URL: https://issues.apache.org/jira/browse/ARROW-15124
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 6.0.1
>            Reporter: Carl Boettiger
>            Priority: Major
>
> I am attempting to use open_dataset() on a large collection of CSV files in which a timestamp column sometimes has a date format and sometimes a timezone format.
> readr is fine reading these both in with a col_type set to "timestamp" (i.e. see below), but arrow_read_csv insists the one must use tz="UTC" while the other must not use tz="UTC" in order for the schema to be valid.  Easiest to see this in a simple example:
> {code:java}
> x <- tempfile()
> df <- data.frame(time = '2021-02-01T00:00:00Z')
> readr::write_csv(df, x)
> schema = arrow::schema(time = timestamp("s", ""))
> # ERROR cannot parse w/o tz="UTC" in the schema:
> arrow::read_csv_arrow(x,schema = schema, skip=1) 
> df2 <- readr::read_csv(x, col_types="T")  # works fine{code}
> {code:java}
> df <- data.frame(time = '2021-02-01')
> readr::write_csv(df, x)
> ## ERROR cannot parse w/ tz="UTC" :
> schema = arrow::schema(time = timestamp("s", "UTC")) 
> arrow::read_csv_arrow(x,schema = schema, skip=1)
> ## Once again, readr has no issues:
> df2 <- readr::read_csv(x, col_types="T")
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)