You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/01/05 11:00:02 UTC

[jira] [Commented] (ARROW-15252) [R] open_dataset - csv file with header and footer

    [ https://issues.apache.org/jira/browse/ARROW-15252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469199#comment-17469199 ] 

Nicola Crane commented on ARROW-15252:
--------------------------------------

Thanks for opening this issue [~martindut].  I think the problem here is that the CSV reader isn't expecting the footer row and is just treating it as data (and so you get that error as it's expecting as many columns as are in the actual data).  The C++ code includes the ability to skip footer rows, but this isn't exposed at the R level (yet).

> [R] open_dataset - csv file with header and footer
> --------------------------------------------------
>
>                 Key: ARROW-15252
>                 URL: https://issues.apache.org/jira/browse/ARROW-15252
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Martin du Toit
>            Priority: Major
>         Attachments: I2478172_Activity_20180830.csv
>
>
> Not sure if this is a bug, but if I open_dataset of a directory containing csv files with a header and a footer, I specify the following convert options to include_missing_columns. The code works fine on files with no header and footer
> {code:r}
> col_names <- c("col names specified as in 2nd row of file") #ie colnames is known
> skip <- 2
> file_path <- "path to directory holding various files"
> #schema_file <- created using arrow::schema
> #schema_df<- created using arrow::schema but with extra columns for the .partition_cols
> conv_options <- CsvConvertOptions$create(strings_can_be_null = TRUE, include_missing_columns = TRUE, include_columns = col_names) 
> read_options <- arrow:::readr_to_csv_read_options(skip, col_names)
> format <- arrow::FileFormat$create(format = "text", schema = schema_file, convert_options = conv_options, read_options  = read_options)
> ds <- arrow::open_dataset(sources = file_path, schema = schema_df, partitioning = .partition_cols, format = format){code}
> The dataset gets created, but any further operation on the dataset fail with
> {code:r}
> Error: Invalid: CSV parse error: Row #7: Expected 41 columns, got 3: T,7,
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)