You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Pal (Jira)" <ji...@apache.org> on 2021/12/02 07:18:00 UTC
[jira] [Closed] (ARROW-14939) [R] Problem with new variables in dataset schema

     [ https://issues.apache.org/jira/browse/ARROW-14939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pal closed ARROW-14939.
-----------------------
    Resolution: Resolved

> [R] Problem with new variables in dataset schema
> ------------------------------------------------
>
>                 Key: ARROW-14939
>                 URL: https://issues.apache.org/jira/browse/ARROW-14939
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 6.0.1
>         Environment: RStudio Version
> --------------------------------------------------
> 1.4.1717
> Session Information
> --------------------------------------------------
> R version 4.1.0 (2021-05-18)
> Platform: x86_64-apple-darwin17.0 (64-bit)
> Running under: macOS 12.0.1
> Matrix products: default
> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
> [1] arrow_6.0.1
> loaded via a namespace (and not attached):
>  [1] tidyselect_1.1.1 bit_4.0.4        compiler_4.1.0   magrittr_2.0.1   assertthat_0.2.1 R6_2.5.1        
>  [7] tools_4.1.0      glue_1.5.0       bit64_4.0.5      vctrs_0.3.8      rlang_0.4.12     purrr_0.3.4     
> System Information
> --------------------------------------------------
> sysname        : Darwin                                                                                         
> release        : 21.1.0                                                                                         
> version        : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64
> nodename       :                                                                    
> machine        : x86_64                                                                                         
> login          : root                                                                                           
> user           : os                                                                                             
> effective_user : os                                                                                             
> Platform Information
> --------------------------------------------------
> OS.type    : unix
> file.sep   : /
> dynlib.ext : .so
> GUI        : RStudio
> endian     : little
> pkgType    : mac.binary
> path.sep   : :
> r_arch     : 
>            Reporter: Pal
>            Priority: Critical
>
> Hi, 
> I have a problem with updating the schema in arrow::open_dataset().
> For example, let's say I have one parquet file with two columns (a and b) and another file with three columns (a and b and c). When I open this dataset, its schema will only detect columns a and b. Am I missing something ? From my previous experience, I already added new columns to some Parquet files which did not exist in other files and the new columns were automatically added to my schema, which was great.
> Hereafter you will find the code to replicate my issue :
>  
> {code:java}
> df = data.frame(a= 1,
>                 b= 2)
>  df_2 = data.frame(a = 2,
>                   b = 3,
>                   c = 4)
> write_parquet(df, "C:/Data/test2/df1.parquet")
> write_parquet(df_2, "C:/Data/test2/df2.parquet")
> ds <- arrow::open_dataset(sources = "C:/Data/test2") ; ds_cols <- data.frame(variables = ds$ schema$ names)
> ds
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)