You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jared Lander (Jira)" <ji...@apache.org> on 2022/03/04 04:17:00 UTC
[jira] [Commented] (ARROW-14063) [R] open_dataset() does not work on CSVs without header rows
[ https://issues.apache.org/jira/browse/ARROW-14063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501145#comment-17501145 ]
Jared Lander commented on ARROW-14063:
--------------------------------------
I know this is marked as resolved, but I just tried with Arrow 7.0 and if I want to use open_dataset() on CSVs with header rows and I want to specify the schema (which I have to because the types are guessed incorrectly), then I have to set skip_rows=1, which seems not awesome, especially for someone who doesn't know about this issue. So I just wanted to put a note here that this is still an open issue.
> [R] open_dataset() does not work on CSVs without header rows
> ------------------------------------------------------------
>
> Key: ARROW-14063
> URL: https://issues.apache.org/jira/browse/ARROW-14063
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 5.0.0
> Environment: sessionInfo()
> R version 4.0.5 (2021-03-31)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.5 LTS
> Matrix products: default
> BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
> LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
> locale:
> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] arrow_5.0.0.2 dplyr_1.0.5 magrittr_2.0.1 targets_0.6.0
> loaded via a namespace (and not attached):
> [1] httr_1.4.2 rnaturalearth_0.1.0 sass_0.4.0 tidyr_1.1.3
> [5] jsonlite_1.7.2 bit64_4.0.5 bslib_0.2.5.1 assertthat_0.2.1
> [9] askpass_1.1 sp_1.4-5 blob_1.2.1 renv_0.13.2
> [13] yaml_2.2.1 globals_0.14.0 pillar_1.5.1 RSQLite_2.2.7
> [17] lattice_0.20-41 glue_1.4.2 digest_0.6.27 htmltools_0.5.1.1
> [21] pkgconfig_2.0.3 RPostgres_1.3.2 listenv_0.8.0 config_0.3.1
> [25] purrr_0.3.4 processx_3.5.1 openssl_1.4.3 tibble_3.1.0
> [29] proxy_0.4-25 aws.s3_0.3.21 colourvalues_0.3.7 generics_0.1.0
> [33] ellipsis_0.3.1 cachem_1.0.5 withr_2.4.1 furrr_0.2.3
> [37] cli_2.4.0 crayon_1.4.1 memoise_2.0.0 evaluate_0.14
> [41] ps_1.6.0 fs_1.5.0 future_1.21.0 fansi_0.4.2
> [45] parallelly_1.25.0 xml2_1.3.2 class_7.3-18 rsconnect_0.8.18
> [49] tools_4.0.5 data.table_1.14.0 hms_1.0.0 lifecycle_1.0.0
> [53] stringr_1.4.0 callr_3.6.0 jquerylib_0.1.4 compiler_4.0.5
> [57] e1071_1.7-6 rlang_0.4.10 classInt_0.4-3 units_0.7-1
> [61] grid_4.0.5 rstudioapi_0.13 visNetwork_2.0.9 htmlwidgets_1.5.3
> [65] aws.signature_0.6.0 crosstalk_1.1.1 igraph_1.2.6 base64enc_0.1-3
> [69] rmarkdown_2.7 codetools_0.2-18 DBI_1.1.1 curl_4.3
> [73] R6_2.5.0 lubridate_1.7.10 knitr_1.31 fastmap_1.1.0
> [77] rgeos_0.5-5 bit_4.0.4 utf8_1.2.1 tarchetypes_0.2.1
> [81] readr_1.4.0 KernSmooth_2.23-18 stringi_1.5.3 parallel_4.0.5
> [85] Rcpp_1.0.6 vctrs_0.3.7 sf_0.9-8 leaflet_2.0.4.1
> [89] dbplyr_2.1.1 tidyselect_1.1.0 xfun_0.22
> Reporter: Jared Lander
> Assignee: Nicola Crane
> Priority: Major
> Labels: bug, pull-request-available
> Fix For: 6.0.0
>
> Time Spent: 2.5h
> Remaining Estimate: 0h
>
> Using {{open_dataset()}} on a CSV without a header row, followed by {{collect()}}, results either in a {{tibble}} of \{{NA}}s or an error depending on duplication of the first row of data. This affects reading one file or a directory of files.
> Here we use the `diamonds` data, where the first row of data does not have any repeat values.
> {code:java}
> library(arrow)
> library(magrittr)
> data(diamonds, package='ggplot2')
> readr::write_csv(head(diamonds), file='diamonds_with_header.csv', col_names=TRUE)
> readr::write_csv(head(diamonds), file='diamonds_without_header.csv', col_names=FALSE)
> diamond_schema <- schema(
> carat=float32()
> , cut=string()
> , color=string()
> , clarity=string()
> , depth=float32()
> , table=float32()
> , price=float32()
> , x=float32()
> , y=float32()
> , z=float32()
> )
> diamonds_with_headers <- open_dataset('diamonds_with_header.csv', schema=diamond_schema, format='csv')
> diamonds_without_headers <- open_dataset('diamonds_without_header.csv', schema=diamond_schema, format='csv')
> # this works
> diamonds_with_headers %>% collect()
> # A tibble: 6 x 10
> carat cut color clarity depth table price x y z
> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 0.230 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
> 2 0.210 Premium E SI1 59.8 61 326 3.89 3.84 2.31
> 3 0.230 Good E VS1 56.9 65 327 4.05 4.07 2.31
> 4 0.290 Premium I VS2 62.4 58 334 4.20 4.23 2.63
> 5 0.310 Good J SI2 63.3 58 335 4.34 4.35 2.75
> 6 0.240 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
> # this gives a tibble with all NA values, though of the correct types
> diamonds_without_headers %>% collect()
> # A tibble: 5 x 10
> carat cut color clarity depth table price x y z
> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 NA NA NA NA NA NA NA NA NA NA
> 2 NA NA NA NA NA NA NA NA NA NA
> 3 NA NA NA NA NA NA NA NA NA NA
> 4 NA NA NA NA NA NA NA NA NA NA
> 5 NA NA NA NA NA NA NA NA NA NA
> {code}
> Now we use a simple dataset where two of the columns in the first row have the same value, 0.0.
>
> {code:java}
> randomDF <- tibble::tibble(
> A=c(0.0, 2.3, 5.1)
> , B=c('a', 'b', 'a')
> , C=c(0.0, 3.1, 4.5)
> )
> readr::write_csv(randomDF, file='random_with_header.csv', col_names=TRUE)
> readr::write_csv(randomDF, file='random_without_header.csv', col_names=FALSE)
> random_schema <- schema(
> A=float32()
> , B=string()
> , C=float32()
> )
> random_with_headers <- open_dataset('random_with_header.csv', schema=random_schema, format='csv')
> random_without_headers <- open_dataset('random_without_header.csv', schema=random_schema, format='csv')
> # gives a tibble with the proper values
> read_with_headers %>% collect()
> # A tibble: 3 x 3
> A B C
> <dbl> <chr> <dbl>
> 1 0 a 0
> 2 2.30 b 3.10
> 3 5.10 a 4.5
> # results in an error
> read_without_headers %>% collect()
> Error: Invalid: Could not open CSV input source 'without_header.csv': Invalid: CSV file contained multiple columns named 0
> {code}
> Interestingly, {{read_csv_arrow()}} has the opposite problem. Providing the schema works for CSVs without headers, but not with, despite the help file saying that providing a schema satisfies both {{col_nmames}} and {{col_types}}.
>
> {code:java}
> diamonds_read_with_header <- read_csv_arrow('diamonds_with_header.csv', schema=diamond_schema)
> Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'carat'
> diamonds_read_without_header <- read_csv_arrow('diamonds_without_header.csv', schema=diamond_schema)
> # reads normally
> random_read_with_header <- read_csv_arrow('random_with_header.csv', schema=random_schema)
> Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'A'
> random_read_without_header <- read_csv_arrow('random_without_header.csv', schema=random_schema)
> # reads normally{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)