You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Bai Ming (Jira)" <ji...@apache.org> on 2019/11/04 13:05:00 UTC

[jira] [Commented] (ARROW-7028) [R] Date roundtrip results in different R storage mode

    [ https://issues.apache.org/jira/browse/ARROW-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16966644#comment-16966644 ] 

Bai Ming commented on ARROW-7028:
---------------------------------

Hi, I've just recently tried out the arrow R package, and I've ran into a similar issue, so just wanted to flag this out (otherwise it's been working great for me - thanks!).

The underlying cause is the same, but I'm working with the data.table package in R, and the change in internal representation causes some date arithmetic and reassignment to the same column to fail (see code example below). Not sure if arrow was meant to work with data.table, but I think it would be helpful to get back the same type of date column after reading.

In my case, I tried to convert my "integer" Date to "double" Date after reading (by converting the column to character and then to Date again), but it was slow as the data was quite large (or perhaps I didn't do it optimally). Eventually I settled on a workaround to simply avoid reassigning to the same column by creating a new column.
{code:java}
library(dplyr)
library(data.table)

tmp = tempdir()
dat = tibble(tag = as.Date("2018-01-01"), group = "A")
dat2 = tibble(tag2 = as.Date("2019-01-01"), group = "A")

arrow::write_parquet(dat, file.path(tmp, "dat.parquet"))
dat = arrow::read_parquet(file.path(tmp, "dat.parquet"))

dt <- as.data.table(dat) # Convert to data.table
dt[, tag := tag + 1, group] # Some date operation to add one day and reassign to same column (This line gives an error)

# Error in `[.data.table`(dt, , `:=`(tag, tag + 1), group) : 
#   Type of RHS ('double') must match LHS ('integer'). To check 
#   and coerce would impact performance too much for the fastest cases.
#   Either change the type of the target column, or coerce the RHS of :=
#   yourself (e.g. by using 1L instead of 1)
{code}

> [R] Date roundtrip results in different R storage mode
> ------------------------------------------------------
>
>                 Key: ARROW-7028
>                 URL: https://issues.apache.org/jira/browse/ARROW-7028
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 0.15.0
>            Reporter: Sascha
>            Priority: Major
>         Attachments: image-2019-10-30-23-08-17-296.png
>
>
> When saving R-dataframes with parquet and loading them again, the internal representation of Dates changes, leading e.g. to errors when comparing them in dplyr::if_else.
> {code}
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #> intersect, setdiff, setequal, union
> tmp = tempdir()
> dat = tibble(tag = as.Date("2018-01-01"))
> dat2 = tibble(tag2 = as.Date("2019-01-01"))
> arrow::write_parquet(dat, file.path(tmp, "dat.parquet"))
> dat = arrow::read_parquet(file.path(tmp, "dat.parquet"))
> typeof(dat$tag)
> #> [1] "integer"
> typeof(dat2$tag2)
> #> [1] "double"
> bind_cols(dat, dat2) %>%
>  mutate(comparison = if_else(TRUE, tag, tag2))
> #> `false` must be a `Date` object, not a `Date` object
> {code}
> Created on 2019-10-30 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)