You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jared Lander (Jira)" <ji...@apache.org> on 2021/09/21 21:55:00 UTC
[jira] [Created] (ARROW-14063) open_dataset() does not work on CSVs
without header rows
Jared Lander created ARROW-14063:
------------------------------------
Summary: open_dataset() does not work on CSVs without header rows
Key: ARROW-14063
URL: https://issues.apache.org/jira/browse/ARROW-14063
Project: Apache Arrow
Issue Type: Bug
Components: R
Affects Versions: 5.0.0
Environment: sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_5.0.0.2 dplyr_1.0.5 magrittr_2.0.1 targets_0.6.0
loaded via a namespace (and not attached):
[1] httr_1.4.2 rnaturalearth_0.1.0 sass_0.4.0 tidyr_1.1.3
[5] jsonlite_1.7.2 bit64_4.0.5 bslib_0.2.5.1 assertthat_0.2.1
[9] askpass_1.1 sp_1.4-5 blob_1.2.1 renv_0.13.2
[13] yaml_2.2.1 globals_0.14.0 pillar_1.5.1 RSQLite_2.2.7
[17] lattice_0.20-41 glue_1.4.2 digest_0.6.27 htmltools_0.5.1.1
[21] pkgconfig_2.0.3 RPostgres_1.3.2 listenv_0.8.0 config_0.3.1
[25] purrr_0.3.4 processx_3.5.1 openssl_1.4.3 tibble_3.1.0
[29] proxy_0.4-25 aws.s3_0.3.21 colourvalues_0.3.7 generics_0.1.0
[33] ellipsis_0.3.1 cachem_1.0.5 withr_2.4.1 furrr_0.2.3
[37] cli_2.4.0 crayon_1.4.1 memoise_2.0.0 evaluate_0.14
[41] ps_1.6.0 fs_1.5.0 future_1.21.0 fansi_0.4.2
[45] parallelly_1.25.0 xml2_1.3.2 class_7.3-18 rsconnect_0.8.18
[49] tools_4.0.5 data.table_1.14.0 hms_1.0.0 lifecycle_1.0.0
[53] stringr_1.4.0 callr_3.6.0 jquerylib_0.1.4 compiler_4.0.5
[57] e1071_1.7-6 rlang_0.4.10 classInt_0.4-3 units_0.7-1
[61] grid_4.0.5 rstudioapi_0.13 visNetwork_2.0.9 htmlwidgets_1.5.3
[65] aws.signature_0.6.0 crosstalk_1.1.1 igraph_1.2.6 base64enc_0.1-3
[69] rmarkdown_2.7 codetools_0.2-18 DBI_1.1.1 curl_4.3
[73] R6_2.5.0 lubridate_1.7.10 knitr_1.31 fastmap_1.1.0
[77] rgeos_0.5-5 bit_4.0.4 utf8_1.2.1 tarchetypes_0.2.1
[81] readr_1.4.0 KernSmooth_2.23-18 stringi_1.5.3 parallel_4.0.5
[85] Rcpp_1.0.6 vctrs_0.3.7 sf_0.9-8 leaflet_2.0.4.1
[89] dbplyr_2.1.1 tidyselect_1.1.0 xfun_0.22
Reporter: Jared Lander
Using {{open_dataset()}} on a CSV without a header row, followed by {{collect()}}, results either in a {{tibble}} of {{NA}}s or an error depending on duplication of the first row of data. This affects reading one file or a directory of files.
Here we use the `diamonds` data, where the first row of data does not have any repeat values.
{code:java}
library(arrow)
library(magrittr)
data(diamonds, package='ggplot2')
readr::write_csv(head(diamonds), file='diamonds_with_header.csv', col_names=TRUE)
readr::write_csv(head(diamonds), file='diamonds_without_header.csv', col_names=FALSE)
diamond_schema <- schema(
carat=float32()
, cut=string()
, color=string()
, clarity=string()
, depth=float32()
, table=float32()
, price=float32()
, x=float32()
, y=float32()
, z=float32()
)
diamonds_with_headers <- open_dataset('diamonds_with_header.csv', schema=diamond_schema, format='csv')
diamonds_without_headers <- open_dataset('diamonds_without_header.csv', schema=diamond_schema, format='csv')
# this works
diamonds_with_headers %>% collect()
# A tibble: 6 x 10
carat cut color clarity depth table price x y z
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.230 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.210 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.230 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.310 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.240 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# this gives a tibble with all NA values, though of the correct types
diamonds_without_headers %>% collect()
# A tibble: 5 x 10
carat cut color clarity depth table price x y z
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA
{code}
Now we use a simple dataset where two of the columns in the first row have the same value, 0.0.
{code:java}
randomDF <- tibble::tibble(
A=c(0.0, 2.3, 5.1)
, B=c('a', 'b', 'a')
, C=c(0.0, 3.1, 4.5)
)
readr::write_csv(randomDF, file='random_with_header.csv', col_names=TRUE)
readr::write_csv(randomDF, file='random_without_header.csv', col_names=FALSE)
random_schema <- schema(
A=float32()
, B=string()
, C=float32()
)
random_with_headers <- open_dataset('random_with_header.csv', schema=random_schema, format='csv')
random_without_headers <- open_dataset('random_without_header.csv', schema=random_schema, format='csv')
# gives a tibble with the proper values
read_with_headers %>% collect()
# A tibble: 3 x 3
A B C
<dbl> <chr> <dbl>
1 0 a 0
2 2.30 b 3.10
3 5.10 a 4.5
# results in an error
read_without_headers %>% collect()
Error: Invalid: Could not open CSV input source '/home/jared.lander@wabisabi.isso.net/projects/lethaldrifter/ld_etl/without_header.csv': Invalid: CSV file contained multiple columns named 0
{code}
Interestingly, {{read_csv_arrow()}} has the opposite problem. Providing the schema works for CSVs without headers, but not with, despite the help file saying that providing a schema satisfies both {{col_nmames}} and {{col_types}}.
{code:java}
diamonds_read_with_header <- read_csv_arrow('diamonds_with_header.csv', schema=diamond_schema)
Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'carat'
diamonds_read_without_header <- read_csv_arrow('diamonds_without_header.csv', schema=diamond_schema)
# reads normally
random_read_with_header <- read_csv_arrow('random_with_header.csv', schema=random_schema)
Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'A'
random_read_without_header <- read_csv_arrow('random_without_header.csv', schema=random_schema)
# reads normally{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)