You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2020/09/25 15:04:00 UTC

[jira] [Commented] (ARROW-10088) [R] Integer64 incorrectly read into R data.table

    [ https://issues.apache.org/jira/browse/ARROW-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202225#comment-17202225 ] 

Neal Richardson commented on ARROW-10088:
-----------------------------------------

My guess is that the issue is related to the fact that you don't have a true {{integer64}}: the "class" attribute has {{np.ulong}} in there too, so the code that looks to preserve R attributes is coming in after the downcast and adding those classes back on. 

On the latest development version, there are some changes in the logic, and this now seems to come out as just "integer". Perhaps try installing a nightly version of the package and see if you see the same?

Even so, this probably isn't ideal. Maybe you want to be able to control whether int64 downcasts to R integer (currently it only does it if all values fit in the int32 range). Maybe you want an explicit "integer64" class from R in the metadata to mean that you don't downcast, for R roundtrip fidelity. Or other possibilities. Unclear though that they justify the added complexity, but happy to discuss.

> [R] Integer64 incorrectly read into R data.table
> ------------------------------------------------
>
>                 Key: ARROW-10088
>                 URL: https://issues.apache.org/jira/browse/ARROW-10088
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 1.0.1
>            Reporter: Kyle Kavanagh
>            Priority: Major
>
> I've got a proprietary dataset where one of the columns is an integer64 but all of the values would fit within 32bits.  As I understand it, arrow/feather will downcast that column when the data is read back into R (not ideal IMO, but not an issue generally).  However, I'm having some trouble with a specific dataset. 
> When I read in the data, the column is set to the class "integer64", however the column type (typeof) is 'integer' and not 'double', which is the underlying type used by bit64.  This mismatch causes R data.table to error out ([https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L325)]
> I do not have any issue with integer64 columns which have values > 2^32, and suspiciously I am also unable to recreate the issue by manually creating a data.table with an int64 column with small values (e.g data.table(col=as.integer64(c(1,2,3))) )
> I did look thru the arrow::r cpp source and couldnt find an obvious case where the underlying storage array would be an integer but also have the 'integer64' class attr assigned...  A fix would either be to remove the integer64 class attr, or ensure that the underlying data store is a REALSXP instead of INTEGERSXP
> My company's network policies wont let me upload the sample dataset, hoping to see if this triggers an immediate thoughts.  If not, I can try to figure our how to upload the dataset or otherwise provide details from it as requested.
>  
> {code:java}
> > arrow::write_feather(df[,list(testCol)][1], '~/test.feather')
> > test = arrow::read_feather('~/test.feather')
> > class(test$testCol)
> [1] "integer64" "np.ulong"
> > typeof(test$testCol)
> [1] "integer"
> > str(test)
> Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       1 obs. of  1 variable: $ testCol:Error in as.character.integer64(object) :  REAL() can only be applied to a 'numeric', not a 'integer'
> #In the larger original dataset, it handles most columns properly, only the 'testCol' breaks things.  Note the difference:
> > typeof(df$goodCol)
> [1] "double"
> > class(df$goodCol)
> [1] "integer64" "np.ulong"
> > typeof(df$testCol)
> [1] "integer"
> > class(df$testCol)
> [1] "integer64" "np.ulong"
> > str(df)
> Classes ‘data.table’ and 'data.frame':  214781 obs. of  17 variables: 
> $ goodCol        :integer64 1599777000000604025 ... 
> $ testCol        :Error in as.character.integer64(object) :
> > sessionInfo()
> R version 3.6.1 (2019-07-05)Platform: x86_64-pc-linux-gnu (64-bit)Running under: Red Hat Enterprise Linux Server 7.7 (Maipo)
> Matrix products: defaultBLAS:   /usr/lib64/libblas.so.3.4.2LAPACK: /usr/lib64/liblapack.so.3.4.2locale: 
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C [9] LC_ADDRESS=C               LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:[1] stats     graphics  grDevices utils     datasets  methods   baseother attached packages:[1] data.table_1.13.0 bit64_4.0.5       bit_4.0.4loaded via a namespace (and not attached): [1] Rcpp_1.0.5           lattice_0.20-41      arrow_1.0.1 [4] assertthat_0.2.1     rappdirs_0.3.1       grid_3.6.1 [7] R6_2.4.1             jsonlite_1.7.1       magrittr_1.5[10] rlang_0.4.7          Matrix_1.2-18        vctrs_0.3.4[13] reticulate_1.14-9001 tools_3.6.1          glue_1.4.2[16] purrr_0.3.4          compiler_3.6.1       tidyselect_1.1.0{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)