You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/06/17 20:32:00 UTC
[jira] [Commented] (ARROW-7018) [R] Non-UTF-8 data in Arrow <--> R
conversion
[ https://issues.apache.org/jira/browse/ARROW-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138800#comment-17138800 ]
Wes McKinney commented on ARROW-7018:
-------------------------------------
Type::STRING must have UTF-8 data. Non-UTF-8 string data should be stored as BINARY -- if you want to preserve the encoding for informational purposes or otherwise you can store it in the metadata of the corresponding Field of the Schema.
I just opened ARROW-9163 about adding a method to make it easier to validate whether a string array has all UTF-8 values
> [R] Non-UTF-8 data in Arrow <--> R conversion
> ---------------------------------------------
>
> Key: ARROW-7018
> URL: https://issues.apache.org/jira/browse/ARROW-7018
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 0.15.0
> Environment: I'm running R on Windows 10
> Reporter: Vidar Ingason
> Assignee: Romain Francois
> Priority: Critical
> Fix For: 1.0.0
>
>
> Hello.
> I'm new to the arrow package in R and I'm having a trouble regarding special characters (Icelandic). I have a large data set and everything is fine until I write the file to disk and read it in again (i.e. I use write_parquet() and then read_parquet()). When I read the data back in to R special characters turn into question mark. I.e. Veitingastaðir becomes Veitingasta�ir.
> This does not happen when I use .csv.
> Is there anything I can do when I write the .parquet file to disk or when I read it in to prevent this?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)