You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jared Lander (Jira)" <ji...@apache.org> on 2021/09/21 21:55:00 UTC
[jira] [Created] (ARROW-14063) open_dataset() does not work on CSVs without header rows

Jared Lander created ARROW-14063:
------------------------------------

             Summary: open_dataset() does not work on CSVs without header rows
                 Key: ARROW-14063
                 URL: https://issues.apache.org/jira/browse/ARROW-14063
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 5.0.0
         Environment: sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] arrow_5.0.0.2  dplyr_1.0.5    magrittr_2.0.1 targets_0.6.0 

loaded via a namespace (and not attached):
 [1] httr_1.4.2          rnaturalearth_0.1.0 sass_0.4.0          tidyr_1.1.3        
 [5] jsonlite_1.7.2      bit64_4.0.5         bslib_0.2.5.1       assertthat_0.2.1   
 [9] askpass_1.1         sp_1.4-5            blob_1.2.1          renv_0.13.2        
[13] yaml_2.2.1          globals_0.14.0      pillar_1.5.1        RSQLite_2.2.7      
[17] lattice_0.20-41     glue_1.4.2          digest_0.6.27       htmltools_0.5.1.1  
[21] pkgconfig_2.0.3     RPostgres_1.3.2     listenv_0.8.0       config_0.3.1       
[25] purrr_0.3.4         processx_3.5.1      openssl_1.4.3       tibble_3.1.0       
[29] proxy_0.4-25        aws.s3_0.3.21       colourvalues_0.3.7  generics_0.1.0     
[33] ellipsis_0.3.1      cachem_1.0.5        withr_2.4.1         furrr_0.2.3        
[37] cli_2.4.0           crayon_1.4.1        memoise_2.0.0       evaluate_0.14      
[41] ps_1.6.0            fs_1.5.0            future_1.21.0       fansi_0.4.2        
[45] parallelly_1.25.0   xml2_1.3.2          class_7.3-18        rsconnect_0.8.18   
[49] tools_4.0.5         data.table_1.14.0   hms_1.0.0           lifecycle_1.0.0    
[53] stringr_1.4.0       callr_3.6.0         jquerylib_0.1.4     compiler_4.0.5     
[57] e1071_1.7-6         rlang_0.4.10        classInt_0.4-3      units_0.7-1        
[61] grid_4.0.5          rstudioapi_0.13     visNetwork_2.0.9    htmlwidgets_1.5.3  
[65] aws.signature_0.6.0 crosstalk_1.1.1     igraph_1.2.6        base64enc_0.1-3    
[69] rmarkdown_2.7       codetools_0.2-18    DBI_1.1.1           curl_4.3           
[73] R6_2.5.0            lubridate_1.7.10    knitr_1.31          fastmap_1.1.0      
[77] rgeos_0.5-5         bit_4.0.4           utf8_1.2.1          tarchetypes_0.2.1  
[81] readr_1.4.0         KernSmooth_2.23-18  stringi_1.5.3       parallel_4.0.5     
[85] Rcpp_1.0.6          vctrs_0.3.7         sf_0.9-8            leaflet_2.0.4.1    
[89] dbplyr_2.1.1        tidyselect_1.1.0    xfun_0.22
            Reporter: Jared Lander


Using {{open_dataset()}} on a CSV without a header row, followed by {{collect()}}, results either in a {{tibble}} of {{NA}}s or an error depending on duplication of the first row of data. This affects reading one file or a directory of files.

Here we use the `diamonds` data, where the first row of data does not have any repeat values.
{code:java}
library(arrow)
library(magrittr)

data(diamonds, package='ggplot2')

readr::write_csv(head(diamonds), file='diamonds_with_header.csv', col_names=TRUE)
readr::write_csv(head(diamonds), file='diamonds_without_header.csv', col_names=FALSE)

diamond_schema <- schema(
    carat=float32()
    , cut=string()
    , color=string()
    , clarity=string()
    , depth=float32()
    , table=float32()
    , price=float32()
    , x=float32()
    , y=float32()
    , z=float32()
)

diamonds_with_headers <- open_dataset('diamonds_with_header.csv', schema=diamond_schema, format='csv')
diamonds_without_headers <- open_dataset('diamonds_without_header.csv', schema=diamond_schema, format='csv')

# this works
diamonds_with_headers %>% collect()
# A tibble: 6 x 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.230 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2 0.210 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3 0.230 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4 0.290 Premium   I     VS2      62.4    58   334  4.20  4.23  2.63
5 0.310 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6 0.240 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

# this gives a tibble with all NA values, though of the correct types
diamonds_without_headers %>% collect()
# A tibble: 5 x 10
  carat cut   color clarity depth table price     x     y     z
  <dbl> <chr> <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
2    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
3    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
4    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
5    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
{code}
Now we use a simple dataset where two of the columns in the first row have the same value, 0.0.

 
{code:java}
randomDF <- tibble::tibble(
    A=c(0.0, 2.3, 5.1)
    , B=c('a', 'b', 'a')
    , C=c(0.0, 3.1, 4.5)
)

readr::write_csv(randomDF, file='random_with_header.csv', col_names=TRUE)
readr::write_csv(randomDF, file='random_without_header.csv', col_names=FALSE)

random_schema <- schema(
    A=float32()
    , B=string()
    , C=float32()
)

random_with_headers <- open_dataset('random_with_header.csv', schema=random_schema, format='csv')
random_without_headers <- open_dataset('random_without_header.csv', schema=random_schema, format='csv')

# gives a tibble with the proper values
read_with_headers %>% collect()
# A tibble: 3 x 3
      A B         C
  <dbl> <chr> <dbl>
1  0    a      0   
2  2.30 b      3.10
3  5.10 a      4.5 

# results in an error
read_without_headers %>% collect()
Error: Invalid: Could not open CSV input source '/home/jared.lander@wabisabi.isso.net/projects/lethaldrifter/ld_etl/without_header.csv': Invalid: CSV file contained multiple columns named 0
{code}
Interestingly, {{read_csv_arrow()}} has the opposite problem. Providing the schema works for CSVs without headers, but not with, despite the help file saying that providing a schema satisfies both {{col_nmames}} and {{col_types}}.

 
{code:java}
diamonds_read_with_header <- read_csv_arrow('diamonds_with_header.csv', schema=diamond_schema)
Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'carat'

diamonds_read_without_header <- read_csv_arrow('diamonds_without_header.csv', schema=diamond_schema)
# reads normally


random_read_with_header <- read_csv_arrow('random_with_header.csv', schema=random_schema)
Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'A'

random_read_without_header <- read_csv_arrow('random_without_header.csv', schema=random_schema)
# reads normally{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)