You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "thisisnic (via GitHub)" <gi...@apache.org> on 2023/04/11 10:44:19 UTC
[GitHub] [arrow] thisisnic commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?
thisisnic commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1503095194
Thanks for reporting this @etiennebacher! I can confirm that this is reproducible on the dev version of Arrow.
You're not missing something obvious; Arrow Dataset objects don't allow you to have duplicated column names I believe. That error message isn't the most helpful, so we could probably do with improving it and/or adding in code which fixes this.
As a temporary workaround, you could manually supply a schema to the data with the corrected column names. I've added a brief example below; let me know if this works for your specific case. If it's still tricky, there'll be other workarounds we can try.
``` r
library(arrow)
file_location <- tempfile(fileext = ".csv")
test <- data.frame(x = 1, x = 2, check.names = FALSE)
write.csv(test, file_location, row.names = FALSE)
# works fine with readr
readr::read_csv(file_location)
#> New names:
#> • `x` -> `x...1`
#> • `x` -> `x...2`
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (2): x...1, x...2
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#> x...1 x...2
#> <dbl> <dbl>
#> 1 1 2
# read in the file as an Arrow Table
file <- read_csv_arrow(file_location, as_data_frame = FALSE)
# extract the schema from the table
my_schema <- file$schema
# we can see the duplicated names here
my_schema
#> Schema
#> x: int64
#> x: int64
# update the second field in the schema to be called "y" instead
my_schema[[2]] <- field("y", int64())
# open the dataset, specifying the new schema
# we have to include "skip" to skip the first row of the file
ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
dplyr::collect(ds)
#> # A tibble: 1 × 2
#> x y
#> <int> <int>
#> 1 1 2
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org