You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "etiennebacher (via GitHub)" <gi...@apache.org> on 2023/04/07 13:33:25 UTC

[GitHub] [arrow] etiennebacher opened a new issue, #34965: Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

etiennebacher opened a new issue, #34965:
URL: https://github.com/apache/arrow/issues/34965

   ### Describe the enhancement requested
   
   Hello, I'm just getting started with `arrow` so I hope this request makes sense.
   
   I have a list of CSV files that have some duplicated column names. When I use `readr::read_csv()`, it automatically "repairs" those names to make them unique but `arrow::open_csv_dataset()` errors because of those duplicated column names:
   
   ``` r
   packageVersion("arrow")
   #> [1] '11.0.0.3'
   
   file_location <- tempfile(fileext = ".csv")
   
   test <- data.frame(x = 1, x = 2, check.names = FALSE)
   write.csv(test, file_location)
   
   readr::read_csv(file_location)
   #> New names:
   #> Rows: 1 Columns: 3
   #> ── Column specification
   #> ──────────────────────────────────────────────────────── Delimiter: "," dbl
   #> (3): ...1, x...2, x...3
   #> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
   #> Specify the column types or set `show_col_types = FALSE` to quiet this message.
   #> • `` -> `...1`
   #> • `x` -> `x...2`
   #> • `x` -> `x...3`
   #> # A tibble: 1 × 3
   #>    ...1 x...2 x...3
   #>   <dbl> <dbl> <dbl>
   #> 1     1     1     2
   
   arrow::open_csv_dataset(file_location)
   #> Error in `open_dataset()`:
   #> ! Invalid: Error creating dataset. Could not read schema from 'C:/Users/etienne/AppData/Local/Temp/RtmpYLHyDe/file3c7c3a407a21.csv': Could not open CSV input source 'C:/Users/etienne/AppData/Local/Temp/RtmpYLHyDe/file3c7c3a407a21.csv': Invalid: CSV file contained multiple columns named x. Is this a 'csv' file?
   #> Backtrace:
   #>     ▆
   #>  1. └─arrow (local) `<fn>`(sources = file_location, delim = ",")
   #>  2.   └─arrow::open_dataset(...)
   #>  3.     └─base::tryCatch(...)
   #>  4.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
   #>  5.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
   #>  6.           └─value[[3L]](cond)
   #>  7.             └─arrow:::augment_io_error_msg(e, call, format = format)
   #>  8.               └─rlang::abort(msg, call = call)
   ```
   
   I could ignore those names with `col_select = - c(<colnames to ignore>)`. However, those columns are not necessarily in all CSV files and if they're not then `open_csv_dataset()` also errors.
   
   Would it be possible to add an argument to repair the column names or to ignore them (i.e not load them)? Or am I missing something obvious? 
   
   Thanks for your work!
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1503354185

   In terms of quicker workarounds, it's hard to say without more context.  If you're expecting random column names to be duplicated, are any of the column names meaningful at all? If not, you could just autogenerate them all, e.g.:
   `arrow::open_csv_dataset(file_location, read_options = list(autogenerate_column_names = TRUE))`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1503095194

   Thanks for reporting this @etiennebacher!  I can confirm that this is reproducible on the dev version of Arrow.
   You're not missing something obvious; Arrow Dataset objects don't allow you to have duplicated column names I believe.  That error message isn't the most helpful, so we could probably do with improving it and/or adding in code which fixes this.
   
   As a temporary workaround, you could manually supply a schema to the data with the corrected column names.  I've added a brief example below; let me know if this works for your specific case.  If it's still tricky, there'll be other workarounds we can try.
   
   ``` r
   library(arrow)
   
   file_location <- tempfile(fileext = ".csv")
   
   test <- data.frame(x = 1, x = 2, check.names = FALSE)
   
   write.csv(test, file_location, row.names = FALSE)
   
   # works fine with readr
   readr::read_csv(file_location)
   #> New names:
   #> • `x` -> `x...1`
   #> • `x` -> `x...2`
   #> Rows: 1 Columns: 2
   #> ── Column specification ────────────────────────────────────────────────────────
   #> Delimiter: ","
   #> dbl (2): x...1, x...2
   #> 
   #> ℹ Use `spec()` to retrieve the full column specification for this data.
   #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
   #> # A tibble: 1 × 2
   #>   x...1 x...2
   #>   <dbl> <dbl>
   #> 1     1     2
   
   # read in the file as an Arrow Table
   file <- read_csv_arrow(file_location, as_data_frame = FALSE)
   
   # extract the schema from the table
   my_schema <- file$schema
   
   # we can see the duplicated names here
   my_schema
   #> Schema
   #> x: int64
   #> x: int64
   
   # update the second field in the schema to be called "y" instead
   my_schema[[2]] <- field("y", int64())
   
   # open the dataset, specifying the new schema
   # we have to include "skip" to skip the first row of the file
   ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
   dplyr::collect(ds)
   #> # A tibble: 1 × 2
   #>       x     y
   #>   <int> <int>
   #> 1     1     2
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] etiennebacher commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

Posted by "etiennebacher (via GitHub)" <gi...@apache.org>.

etiennebacher commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1503243275

   Thank you for your answer @thisisnic. The workaround you provided works in this very simple case because there are only 2 columns, but I have tens or hundreds of them in my scenario. I improved it a bit to detect the duplicated names, repair them by adding a random suffix, and plugging them back:
   
   ``` r
   library(arrow)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   packageVersion("arrow")
   #> [1] '11.0.0.3'
   
   file_location <- tempfile(fileext = ".csv")
   
   test <- data.frame(x = 1, x = 2, check.names = FALSE)
   write.csv(test, file_location)
   
   file <- read_csv_arrow(file_location, as_data_frame = FALSE)
   
   # extract the schema from the table
   my_schema <- file$schema
   
   # we can see the duplicated names here
   dupes <- which(duplicated(names(my_schema)))
   
   for (i in dupes) {
     
     # get original variable name and add a random suffix (so that the new name
     # is not a duplicate of another one)
     orig <- names(my_schema)[i]
     set.seed(i)
     suffix <- paste(sample(letters, 8), collapse = "")
     
     new_var <- paste0(orig, "_", suffix)
     
     # get the variable type
     orig_field <- my_schema$fields[[i]]$type$code()
     
     # update the variable
     my_schema[[i]] <- field(new_var, eval(orig_field))
     
     cat(paste("Old variable name:", orig, "\nNew variable name:", new_var, "\n\n"))
     
   }
   #> Old variable name: x 
   #> New variable name: x_elgdhkvj
   
   # open the dataset, specifying the new schema
   # we have to include "skip" to skip the first row of the file
   ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
   dplyr::collect(ds)
   #> # A tibble: 1 × 3
   #>      ``     x x_elgdhkvj
   #>   <int> <int>      <int>
   #> 1     1     1          2
   ```
   
   (Note that I didn't check that this worked with more than 2 duplicated names.)
   
   Also, while this workaround is fast for small files, the original `read_csv_arrow()` takes some time. Nothing crazy, but extended to dozens of files, this can pile up and lead to an important delay. Maybe there's a faster way to do this?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1505259909

   Found an existing ticket open relating to this: [[C++] Decide on duplicate column handling in scanner, add more tests](https://github.com/apache/arrow/issues/33552#top)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] etiennebacher commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

Posted by "etiennebacher (via GitHub)" <gi...@apache.org>.

etiennebacher commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1505196369

   Yes the column names are useful. To be a bit more specific this is some census data that appears to have one or two duplicated columns. The thing is that these duplicated columns are not always in the same position so any hardcoded index won't help here. Using `autogenerate_column_names = TRUE` is not an option since I would lose all information about column names.
   
   I tried to make a small example with 200 vars and 1,000,000 rows but can't reproduce the time that took `read_csv_arrow()` 
   in my "real" example. 
   
   ``` r
   library(arrow)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   library(tictoc)
   packageVersion("arrow")
   #> [1] '11.0.0.3'
   
   file_location <- tempfile(fileext = ".csv")
   
   # make a fake "big" dataset
   tmp <- list()
   for (i in 1:200) {
     set.seed(i)
     tmp[[paste0("var_", i)]] <- sample(1:100, 1e6, TRUE)
   }
   test <- list2DF(tmp)
   
   # make a duplicated column name
   names(test)[62] <- "var_1"
   
   readr::write_csv(test, file_location)
   
   tictoc::tic()
   file <- read_csv_arrow(file_location, as_data_frame = FALSE)
   tictoc::toc()
   #> 11.82 sec elapsed
   
   # extract the schema from the table
   my_schema <- file$schema
   
   # we can see the duplicated names here
   dupes <- which(duplicated(names(my_schema)))
   
   for (i in dupes) {
     
     # get original variable name and add a random suffix (so that the new name
     # is not a duplicate of another one)
     orig <- names(my_schema)[i]
     set.seed(i)
     suffix <- paste(sample(letters, 8), collapse = "")
     
     new_var <- paste0(orig, "_", suffix)
     
     # get the variable type
     orig_field <- my_schema$fields[[i]]$type$code()
     
     # update the variable
     my_schema[[i]] <- field(new_var, eval(orig_field))
     
     cat(paste("Old variable name:", orig, "\nNew variable name:", new_var, "\n\n"))
     
   }
   #> Old variable name: var_1 
   #> New variable name: var_1_vkeoxuwd
   
   # open the dataset, specifying the new schema
   # we have to include "skip" to skip the first row of the file
   ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
   out <- dplyr::collect(ds)
   ```
   
   This is not an urgent issue for me but I think having a way to automatically repair duplicated column names would be useful. Is it feasible to implement? (You proposed a better error message and a workaround, so I'd just like to clarify on whether it could be a feature later)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1505254724

   > Is it feasible to implement? 
   
   We can absolutely open up a C++ ticket, and see if anyone has time to work on it. 
   
   Another thing which is tangentially related, but may help is #35035 which would make it easier to rename all fields in a schema in one go.  It doesn't solve the problem but could lead to a faster workaround once you have the existing schema.
   
   A faster way of extracting the existing schema, would be to use `readr::read_csv()` which can limit the number of lines to read and works really quickly, e.g.
   
   ```
   one_row <- readr::read_csv(file_location, n_max = 1) %>%
     arrow_table(as_data_frame = FALSE)
   
   one_row$schema 
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org