You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Pal (Jira)" <ji...@apache.org> on 2021/12/01 11:46:00 UTC
[jira] [Created] (ARROW-14939) [R] Problem with new variables in dataset schema

Pal created ARROW-14939:
---------------------------

             Summary: [R] Problem with new variables in dataset schema
                 Key: ARROW-14939
                 URL: https://issues.apache.org/jira/browse/ARROW-14939
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 6.0.1
         Environment: 
RStudio Version
--------------------------------------------------
1.4.1717


Session Information
--------------------------------------------------
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 12.0.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] arrow_6.0.1

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1 bit_4.0.4        compiler_4.1.0   magrittr_2.0.1   assertthat_0.2.1 R6_2.5.1        
 [7] tools_4.1.0      glue_1.5.0       bit64_4.0.5      vctrs_0.3.8      rlang_0.4.12     purrr_0.3.4     


System Information
--------------------------------------------------
sysname        : Darwin                                                                                         
release        : 21.1.0                                                                                         
version        : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64
nodename       :                                                                    
machine        : x86_64                                                                                         
login          : root                                                                                           
user           : os                                                                                             
effective_user : os                                                                                             


Platform Information
--------------------------------------------------
OS.type    : unix
file.sep   : /
dynlib.ext : .so
GUI        : RStudio
endian     : little
pkgType    : mac.binary
path.sep   : :
r_arch     : 
            Reporter: Pal


Hi, 

I have a problem with updating the schema in arrow::open_dataset().

For example, let's say I have one parquet file with two columns (a and b) and another file with three columns (a and b and c). When I open this dataset, its schema will only detect columns a and b. Am I missing something ? From my previous experience, I already added new columns to some Parquet files which did not exist in other files and the new columns were automatically added to my schema, which was great.

Hereafter you will find the code to replicate my issue :

 
{code:java}
df = data.frame(a= 1,
                b= 2)
 df_2 = data.frame(a = 2,
                  b = 3,
                  c = 4)
write_parquet(df, "C:/Data/test2/df1.parquet")
write_parquet(df_2, "C:/Data/test2/df2.parquet")
ds <- arrow::open_dataset(sources = "C:/Data/test2") ; ds_cols <- data.frame(variables = ds$ schema$ names)
ds
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)