You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "David Wales (Jira)" <ji...@apache.org> on 2021/03/31 03:09:00 UTC

[jira] [Created] (ARROW-12162) [R] read_parquet returns Invalid UTF8 payload

David Wales created ARROW-12162:
-----------------------------------

             Summary: [R] read_parquet returns Invalid UTF8 payload
                 Key: ARROW-12162
                 URL: https://issues.apache.org/jira/browse/ARROW-12162
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 3.0.0
         Environment: Windows 10
R 4.0.3
arrow 3.0.0
dbplyr 2.0.0
dplyr 1.0.2
            Reporter: David Wales
         Attachments: bad_char.rds

h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using `dbplyr` and saving the output as a parquet file:

 
{code:java}
# Assume `con` is a previously established connection to the database created with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 payload":

 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following:

 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)