You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joost Hoozemans (Jira)" <ji...@apache.org> on 2022/07/15 15:04:00 UTC

[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

    [ https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567275#comment-17567275 ] 

Joost Hoozemans commented on ARROW-16000:
-----------------------------------------

Hi,

I've come across a similar issue when using pyarrow.dataset on non-utf8 data. Using read_csv works, because a transcoding is performed on the fly using python codecs. But when using multi-file datasets, doing it that way would mean everything would have to go through a single python interpreter. And then it will still not work for other language bindings. Can we add transcoding on the C++ side?

> [C++][Dataset] Support Latin-1 encoding
> ---------------------------------------
>
>                 Key: ARROW-16000
>                 URL: https://issues.apache.org/jira/browse/ARROW-16000
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with Latin-1 encoding.  I had a look through the docs for the Dataset API and I don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)