You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/03/22 11:04:00 UTC

[jira] [Commented] (ARROW-15992) [R] csv file encoding working for one file, but not a folder of files

    [ https://issues.apache.org/jira/browse/ARROW-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510410#comment-17510410 ] 

Nicola Crane commented on ARROW-15992:
--------------------------------------

Thanks for reporting this [~gregleleu] . I don't think this is currently supported - I've opened ticket ARROW-16000 to ask for this to be implemented in the C++, so once it has been we should be able to expose this functionality in R.

> [R] csv file encoding working for one file, but not a folder of files
> ---------------------------------------------------------------------
>
>                 Key: ARROW-15992
>                 URL: https://issues.apache.org/jira/browse/ARROW-15992
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Gregoire Leleu
>            Priority: Major
>         Attachments: Test1.txt
>
>
> The encoding options are passed when a single file is read with read_delim_arrow, but not when opening a folder with open_dataset.
> read_delim_arrow creates a reader using CsvTableReader$create (which is what is tested in the package's tests).
> open_dataset creates a factory and I'm unable to follow what happens when $Finish() is called.
>  
> Also, the documentation ("CsvReadOptions" page) lists the "encoding" option under "CsvConvertOptions$create()" instead of "CsvReadOptions$create()"
>  
> {code:r}
> library(dplyr)
> library(arrow)
> # Opens one file just fine:
> one_file <- arrow::read_delim_arrow(
>   "test/Test1.txt", 
>   as_data_frame = FALSE,
>   delim = ";",
>   read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
> )
> collect(one_file)
>  
> # Can't open the folder that has "Test1.txt" properly, results in Column2 being typed as binary
> one_folder <- arrow::open_dataset(
>   "test", 
>   delim = ";",
>   read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
> )
> collect(one_folder)
>  
> # Even when specify the schema
> one_folder_w_schema <- arrow::open_dataset(
>   "test", 
>   schema = Schema$create(Column1 = string(), Column2 = string()),
>   format = FileFormat$create("text", skip_rows = 1L, delimiter = ";", column_names = c("Column1", "Column2"),
>                              read_options = CsvReadOptions$create(encoding = "ISO-8859-1"))
>   
> )
> collect(one_folder_w_schema) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)