You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Wales (Jira)" <ji...@apache.org> on 2021/03/31 03:11:00 UTC
[jira] [Updated] (ARROW-12162) [R] read_parquet returns Invalid
UTF8 payload
[ https://issues.apache.org/jira/browse/ARROW-12162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Wales updated ARROW-12162:
--------------------------------
Description:
h2. Background
I am using the R arrow library.
I am reading from an SQL Server database with the `latin1` encoding using `dbplyr` and saving the output as a parquet file:
{code:java}
# Assume `con` is a previously established connection to the database created with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%
collect() %>%
write_parquet("output.parquet")
{code}
However, when I try to read the file back, I get the error "Invalid UTF8 payload":
{code:java}
> read_parquet("output.parquet")
Error: Invalid: Invalid UTF8 payload
{code}
h2. Minimal Reproducible Example
I have isolated this issue to a minimal reproducible example.
If the database table contains the latin1 single quote character, then it will trigger the error.
I have attached a `.rds` file which contains an example tibble.
To reproduce, run the following:
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% write_parquet(file.path(data_dir, "bad_char.parquet"))
read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}
was:
h2. Background
I am using the R arrow library.
I am reading from an SQL Server database with the `latin1` encoding using `dbplyr` and saving the output as a parquet file:
{code:java}
# Assume `con` is a previously established connection to the database created with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%
collect() %>%
write_parquet("output.parquet")
{code}
However, when I try to read the file back, I get the error "Invalid UTF8 payload":
{code:java}
> read_parquet("output.parquet")
Error: Invalid: Invalid UTF8 payload
{code}
h2. Minimal Reproducible Example
I have isolated this issue to a minimal reproducible example.
If the database table contains the latin1 single quote character, then it will trigger the error.
I have attached a `.rds` file which contains an example tibble.
To reproduce, run the following:
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% write_parquet(file.path(data_dir, "bad_char.parquet"))
read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}
> [R] read_parquet returns Invalid UTF8 payload
> ---------------------------------------------
>
> Key: ARROW-12162
> URL: https://issues.apache.org/jira/browse/ARROW-12162
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 3.0.0
> Environment: Windows 10
> R 4.0.3
> arrow 3.0.0
> dbplyr 2.0.0
> dplyr 1.0.2
> Reporter: David Wales
> Priority: Major
> Attachments: bad_char.rds
>
>
> h2. Background
> I am using the R arrow library.
> I am reading from an SQL Server database with the `latin1` encoding using `dbplyr` and saving the output as a parquet file:
> {code:java}
> # Assume `con` is a previously established connection to the database created with DBI::dbConnect
> tbl(con, in_schema("dbo", "latin1_table")) %>%
> collect() %>%
> write_parquet("output.parquet")
> {code}
>
> However, when I try to read the file back, I get the error "Invalid UTF8 payload":
> {code:java}
> > read_parquet("output.parquet")
> Error: Invalid: Invalid UTF8 payload
> {code}
> h2. Minimal Reproducible Example
> I have isolated this issue to a minimal reproducible example.
> If the database table contains the latin1 single quote character, then it will trigger the error.
> I have attached a `.rds` file which contains an example tibble.
> To reproduce, run the following:
> {code:java}
> readRDS(file.path(data_dir, "bad_char.rds")) %>% write_parquet(file.path(data_dir, "bad_char.parquet"))
> read_parquet(file.path(data_dir, "bad_char.parquet"))
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)