You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jared Lander (Jira)" <ji...@apache.org> on 2022/03/04 04:17:00 UTC
[jira] [Commented] (ARROW-14063) [R] open_dataset() does not work on CSVs without header rows

    [ https://issues.apache.org/jira/browse/ARROW-14063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501145#comment-17501145 ] 

Jared Lander commented on ARROW-14063:
--------------------------------------

I know this is marked as resolved, but I just tried with Arrow 7.0 and if I want to use open_dataset() on CSVs with header rows and I want to specify the schema (which I have to because the types are guessed incorrectly), then I have to set skip_rows=1, which seems not awesome, especially for someone who doesn't know about this issue. So I just wanted to put a note here that this is still an open issue.

> [R] open_dataset() does not work on CSVs without header rows
> ------------------------------------------------------------
>
>                 Key: ARROW-14063
>                 URL: https://issues.apache.org/jira/browse/ARROW-14063
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 5.0.0
>         Environment: sessionInfo()
> R version 4.0.5 (2021-03-31)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.5 LTS
> Matrix products: default
> BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
> LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
> locale:
>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
> [1] arrow_5.0.0.2  dplyr_1.0.5    magrittr_2.0.1 targets_0.6.0 
> loaded via a namespace (and not attached):
>  [1] httr_1.4.2          rnaturalearth_0.1.0 sass_0.4.0          tidyr_1.1.3        
>  [5] jsonlite_1.7.2      bit64_4.0.5         bslib_0.2.5.1       assertthat_0.2.1   
>  [9] askpass_1.1         sp_1.4-5            blob_1.2.1          renv_0.13.2        
> [13] yaml_2.2.1          globals_0.14.0      pillar_1.5.1        RSQLite_2.2.7      
> [17] lattice_0.20-41     glue_1.4.2          digest_0.6.27       htmltools_0.5.1.1  
> [21] pkgconfig_2.0.3     RPostgres_1.3.2     listenv_0.8.0       config_0.3.1       
> [25] purrr_0.3.4         processx_3.5.1      openssl_1.4.3       tibble_3.1.0       
> [29] proxy_0.4-25        aws.s3_0.3.21       colourvalues_0.3.7  generics_0.1.0     
> [33] ellipsis_0.3.1      cachem_1.0.5        withr_2.4.1         furrr_0.2.3        
> [37] cli_2.4.0           crayon_1.4.1        memoise_2.0.0       evaluate_0.14      
> [41] ps_1.6.0            fs_1.5.0            future_1.21.0       fansi_0.4.2        
> [45] parallelly_1.25.0   xml2_1.3.2          class_7.3-18        rsconnect_0.8.18   
> [49] tools_4.0.5         data.table_1.14.0   hms_1.0.0           lifecycle_1.0.0    
> [53] stringr_1.4.0       callr_3.6.0         jquerylib_0.1.4     compiler_4.0.5     
> [57] e1071_1.7-6         rlang_0.4.10        classInt_0.4-3      units_0.7-1        
> [61] grid_4.0.5          rstudioapi_0.13     visNetwork_2.0.9    htmlwidgets_1.5.3  
> [65] aws.signature_0.6.0 crosstalk_1.1.1     igraph_1.2.6        base64enc_0.1-3    
> [69] rmarkdown_2.7       codetools_0.2-18    DBI_1.1.1           curl_4.3           
> [73] R6_2.5.0            lubridate_1.7.10    knitr_1.31          fastmap_1.1.0      
> [77] rgeos_0.5-5         bit_4.0.4           utf8_1.2.1          tarchetypes_0.2.1  
> [81] readr_1.4.0         KernSmooth_2.23-18  stringi_1.5.3       parallel_4.0.5     
> [85] Rcpp_1.0.6          vctrs_0.3.7         sf_0.9-8            leaflet_2.0.4.1    
> [89] dbplyr_2.1.1        tidyselect_1.1.0    xfun_0.22
>            Reporter: Jared Lander
>            Assignee: Nicola Crane
>            Priority: Major
>              Labels: bug, pull-request-available
>             Fix For: 6.0.0
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Using {{open_dataset()}} on a CSV without a header row, followed by {{collect()}}, results either in a {{tibble}} of \{{NA}}s or an error depending on duplication of the first row of data. This affects reading one file or a directory of files.
> Here we use the `diamonds` data, where the first row of data does not have any repeat values.
> {code:java}
> library(arrow)
> library(magrittr)
> data(diamonds, package='ggplot2')
> readr::write_csv(head(diamonds), file='diamonds_with_header.csv', col_names=TRUE)
> readr::write_csv(head(diamonds), file='diamonds_without_header.csv', col_names=FALSE)
> diamond_schema <- schema(
>     carat=float32()
>     , cut=string()
>     , color=string()
>     , clarity=string()
>     , depth=float32()
>     , table=float32()
>     , price=float32()
>     , x=float32()
>     , y=float32()
>     , z=float32()
> )
> diamonds_with_headers <- open_dataset('diamonds_with_header.csv', schema=diamond_schema, format='csv')
> diamonds_without_headers <- open_dataset('diamonds_without_header.csv', schema=diamond_schema, format='csv')
> # this works
> diamonds_with_headers %>% collect()
> # A tibble: 6 x 10
>   carat cut       color clarity depth table price     x     y     z
>   <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 0.230 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
> 2 0.210 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
> 3 0.230 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
> 4 0.290 Premium   I     VS2      62.4    58   334  4.20  4.23  2.63
> 5 0.310 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
> 6 0.240 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
> # this gives a tibble with all NA values, though of the correct types
> diamonds_without_headers %>% collect()
> # A tibble: 5 x 10
>   carat cut   color clarity depth table price     x     y     z
>   <dbl> <chr> <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
> 2    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
> 3    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
> 4    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
> 5    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
> {code}
> Now we use a simple dataset where two of the columns in the first row have the same value, 0.0.
>  
> {code:java}
> randomDF <- tibble::tibble(
>     A=c(0.0, 2.3, 5.1)
>     , B=c('a', 'b', 'a')
>     , C=c(0.0, 3.1, 4.5)
> )
> readr::write_csv(randomDF, file='random_with_header.csv', col_names=TRUE)
> readr::write_csv(randomDF, file='random_without_header.csv', col_names=FALSE)
> random_schema <- schema(
>     A=float32()
>     , B=string()
>     , C=float32()
> )
> random_with_headers <- open_dataset('random_with_header.csv', schema=random_schema, format='csv')
> random_without_headers <- open_dataset('random_without_header.csv', schema=random_schema, format='csv')
> # gives a tibble with the proper values
> read_with_headers %>% collect()
> # A tibble: 3 x 3
>       A B         C
>   <dbl> <chr> <dbl>
> 1  0    a      0   
> 2  2.30 b      3.10
> 3  5.10 a      4.5 
> # results in an error
> read_without_headers %>% collect()
> Error: Invalid: Could not open CSV input source 'without_header.csv': Invalid: CSV file contained multiple columns named 0
> {code}
> Interestingly, {{read_csv_arrow()}} has the opposite problem. Providing the schema works for CSVs without headers, but not with, despite the help file saying that providing a schema satisfies both {{col_nmames}} and {{col_types}}.
>  
> {code:java}
> diamonds_read_with_header <- read_csv_arrow('diamonds_with_header.csv', schema=diamond_schema)
> Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'carat'
> diamonds_read_without_header <- read_csv_arrow('diamonds_without_header.csv', schema=diamond_schema)
> # reads normally
> random_read_with_header <- read_csv_arrow('random_with_header.csv', schema=random_schema)
> Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'A'
> random_read_without_header <- read_csv_arrow('random_without_header.csv', schema=random_schema)
> # reads normally{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)