You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "etiennebacher (via GitHub)" <gi...@apache.org> on 2023/04/11 12:32:10 UTC

[GitHub] [arrow] etiennebacher commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

etiennebacher commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1503243275

   Thank you for your answer @thisisnic. The workaround you provided works in this very simple case because there are only 2 columns, but I have tens or hundreds of them in my scenario. I improved it a bit to detect the duplicated names, repair them by adding a random suffix, and plugging them back:
   
   ``` r
   library(arrow)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   packageVersion("arrow")
   #> [1] '11.0.0.3'
   
   file_location <- tempfile(fileext = ".csv")
   
   test <- data.frame(x = 1, x = 2, check.names = FALSE)
   write.csv(test, file_location)
   
   file <- read_csv_arrow(file_location, as_data_frame = FALSE)
   
   # extract the schema from the table
   my_schema <- file$schema
   
   # we can see the duplicated names here
   dupes <- which(duplicated(names(my_schema)))
   
   for (i in dupes) {
     
     # get original variable name and add a random suffix (so that the new name
     # is not a duplicate of another one)
     orig <- names(my_schema)[i]
     set.seed(i)
     suffix <- paste(sample(letters, 8), collapse = "")
     
     new_var <- paste0(orig, "_", suffix)
     
     # get the variable type
     orig_field <- my_schema$fields[[i]]$type$code()
     
     # update the variable
     my_schema[[i]] <- field(new_var, eval(orig_field))
     
     cat(paste("Old variable name:", orig, "\nNew variable name:", new_var, "\n\n"))
     
   }
   #> Old variable name: x 
   #> New variable name: x_elgdhkvj
   
   # open the dataset, specifying the new schema
   # we have to include "skip" to skip the first row of the file
   ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
   dplyr::collect(ds)
   #> # A tibble: 1 × 3
   #>      ``     x x_elgdhkvj
   #>   <int> <int>      <int>
   #> 1     1     1          2
   ```
   
   (Note that I didn't check that this worked with more than 2 duplicated names.)
   
   Also, while this workaround is fast for small files, the original `read_csv_arrow()` takes some time. Nothing crazy, but extended to dozens of files, this can pile up and lead to an important delay. Maybe there's a faster way to do this?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org