You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/03/22 13:44:00 UTC
[jira] [Updated] (ARROW-15992) [R] csv file encoding working for one file, but not a folder of files
[ https://issues.apache.org/jira/browse/ARROW-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicola Crane updated ARROW-15992:
---------------------------------
Issue Type: Bug (was: New Feature)
> [R] csv file encoding working for one file, but not a folder of files
> ---------------------------------------------------------------------
>
> Key: ARROW-15992
> URL: https://issues.apache.org/jira/browse/ARROW-15992
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Gregoire Leleu
> Priority: Major
> Attachments: Test1.txt
>
>
> The encoding options are passed when a single file is read with read_delim_arrow, but not when opening a folder with open_dataset.
> read_delim_arrow creates a reader using CsvTableReader$create (which is what is tested in the package's tests).
> open_dataset creates a factory and I'm unable to follow what happens when $Finish() is called.
>
> Also, the documentation ("CsvReadOptions" page) lists the "encoding" option under "CsvConvertOptions$create()" instead of "CsvReadOptions$create()"
>
> {code:r}
> library(dplyr)
> library(arrow)
> # Opens one file just fine:
> one_file <- arrow::read_delim_arrow(
> "test/Test1.txt",
> as_data_frame = FALSE,
> delim = ";",
> read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
> )
> collect(one_file)
>
> # Can't open the folder that has "Test1.txt" properly, results in Column2 being typed as binary
> one_folder <- arrow::open_dataset(
> "test",
> delim = ";",
> read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
> )
> collect(one_folder)
>
> # Even when specify the schema
> one_folder_w_schema <- arrow::open_dataset(
> "test",
> schema = Schema$create(Column1 = string(), Column2 = string()),
> format = FileFormat$create("text", skip_rows = 1L, delimiter = ";", column_names = c("Column1", "Column2"),
> read_options = CsvReadOptions$create(encoding = "ISO-8859-1"))
>
> )
> collect(one_folder_w_schema) {code}
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)