You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/10/27 20:18:00 UTC

[jira] [Updated] (ARROW-15992) [R] csv file encoding working for one file, but not a folder of files

     [ https://issues.apache.org/jira/browse/ARROW-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicola Crane updated ARROW-15992:
---------------------------------
        Parent: ARROW-18181
    Issue Type: Sub-task  (was: Bug)

> [R] csv file encoding working for one file, but not a folder of files
> ---------------------------------------------------------------------
>
>                 Key: ARROW-15992
>                 URL: https://issues.apache.org/jira/browse/ARROW-15992
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: R
>            Reporter: Gregoire Leleu
>            Priority: Major
>         Attachments: Test1.txt
>
>
> The encoding options are passed when a single file is read with read_delim_arrow, but not when opening a folder with open_dataset.
> read_delim_arrow creates a reader using CsvTableReader$create (which is what is tested in the package's tests).
> open_dataset creates a factory and I'm unable to follow what happens when $Finish() is called.
>  
> Also, the documentation ("CsvReadOptions" page) lists the "encoding" option under "CsvConvertOptions$create()" instead of "CsvReadOptions$create()"
>  
> {code:r}
> library(dplyr)
> library(arrow)
> # Opens one file just fine:
> one_file <- arrow::read_delim_arrow(
>   "test/Test1.txt", 
>   as_data_frame = FALSE,
>   delim = ";",
>   read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
> )
> collect(one_file)
>  
> # Can't open the folder that has "Test1.txt" properly, results in Column2 being typed as binary
> one_folder <- arrow::open_dataset(
>   "test", 
>   delim = ";",
>   read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
> )
> collect(one_folder)
>  
> # Even when specify the schema
> one_folder_w_schema <- arrow::open_dataset(
>   "test", 
>   schema = Schema$create(Column1 = string(), Column2 = string()),
>   format = FileFormat$create("text", skip_rows = 1L, delimiter = ";", column_names = c("Column1", "Column2"),
>                              read_options = CsvReadOptions$create(encoding = "ISO-8859-1"))
>   
> )
> collect(one_folder_w_schema) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)