You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Martin du Toit (Jira)" <ji...@apache.org> on 2022/01/05 11:56:00 UTC
[jira] [Commented] (ARROW-15252) [R] Expose skip_rows_after in CSVReadOptions
[ https://issues.apache.org/jira/browse/ARROW-15252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469241#comment-17469241 ]
Martin du Toit commented on ARROW-15252:
----------------------------------------
Hi [~thisisnic] , thanks for getting back to me.
I also tried it with pyarrow, although not my preferred language, but got the same error. Is it possible with pyarrow, or also not exposed to pyarrow?
> [R] Expose skip_rows_after in CSVReadOptions
> ---------------------------------------------
>
> Key: ARROW-15252
> URL: https://issues.apache.org/jira/browse/ARROW-15252
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Martin du Toit
> Assignee: Nicola Crane
> Priority: Major
> Attachments: I2478172_Activity_20180830.csv
>
>
> Not sure if this is a bug, but if I open_dataset of a directory containing csv files with a header and a footer, I specify the following convert options to include_missing_columns. The code works fine on files with no header and footer
> {code:r}
> col_names <- c("col names specified as in 2nd row of file") #ie colnames is known
> skip <- 2
> file_path <- "path to directory holding various files"
> #schema_file <- created using arrow::schema
> #schema_df<- created using arrow::schema but with extra columns for the .partition_cols
> conv_options <- CsvConvertOptions$create(strings_can_be_null = TRUE, include_missing_columns = TRUE, include_columns = col_names)
> read_options <- arrow:::readr_to_csv_read_options(skip, col_names)
> format <- arrow::FileFormat$create(format = "text", schema = schema_file, convert_options = conv_options, read_options = read_options)
> ds <- arrow::open_dataset(sources = file_path, schema = schema_df, partitioning = .partition_cols, format = format){code}
> The dataset gets created, but any further operation on the dataset fail with
> {code:r}
> Error: Invalid: CSV parse error: Row #7: Expected 41 columns, got 3: T,7,
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)