You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "thisisnic (via GitHub)" <gi...@apache.org> on 2023/04/11 10:44:19 UTC

[GitHub] [arrow] thisisnic commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

thisisnic commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1503095194

   Thanks for reporting this @etiennebacher!  I can confirm that this is reproducible on the dev version of Arrow.
   You're not missing something obvious; Arrow Dataset objects don't allow you to have duplicated column names I believe.  That error message isn't the most helpful, so we could probably do with improving it and/or adding in code which fixes this.
   
   As a temporary workaround, you could manually supply a schema to the data with the corrected column names.  I've added a brief example below; let me know if this works for your specific case.  If it's still tricky, there'll be other workarounds we can try.
   
   ``` r
   library(arrow)
   
   file_location <- tempfile(fileext = ".csv")
   
   test <- data.frame(x = 1, x = 2, check.names = FALSE)
   
   write.csv(test, file_location, row.names = FALSE)
   
   # works fine with readr
   readr::read_csv(file_location)
   #> New names:
   #> • `x` -> `x...1`
   #> • `x` -> `x...2`
   #> Rows: 1 Columns: 2
   #> ── Column specification ────────────────────────────────────────────────────────
   #> Delimiter: ","
   #> dbl (2): x...1, x...2
   #> 
   #> ℹ Use `spec()` to retrieve the full column specification for this data.
   #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
   #> # A tibble: 1 × 2
   #>   x...1 x...2
   #>   <dbl> <dbl>
   #> 1     1     2
   
   # read in the file as an Arrow Table
   file <- read_csv_arrow(file_location, as_data_frame = FALSE)
   
   # extract the schema from the table
   my_schema <- file$schema
   
   # we can see the duplicated names here
   my_schema
   #> Schema
   #> x: int64
   #> x: int64
   
   # update the second field in the schema to be called "y" instead
   my_schema[[2]] <- field("y", int64())
   
   # open the dataset, specifying the new schema
   # we have to include "skip" to skip the first row of the file
   ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
   dplyr::collect(ds)
   #> # A tibble: 1 × 2
   #>       x     y
   #>   <int> <int>
   #> 1     1     2
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org