You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2022/07/15 19:18:00 UTC

[jira] [Comment Edited] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

    [ https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567364#comment-17567364 ] 

Antoine Pitrou edited comment on ARROW-16000 at 7/15/22 7:17 PM:
-----------------------------------------------------------------

I would want to know first if there's actual contention due to the Python GIL and/or interpreter overhead.

In C++ the basic building block is {{TransformInputStream}}: https://arrow.apache.org/docs/cpp/api/io.html#transforming-input-wrapper , https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/transform.h (sadly, it seems this lacks docstrings; my bad). It should be easy for a normally skilled C++ developer to use it to wrap their transcoding library of choice (some might want to use ICU, others libiconv, etc.).

I think it would be ideal if we offered an optional header-only that would wrap ICU in a {{TransformInputStream}}, without actually requiring ICU to be present when compiling Arrow. Perhaps this can be through templates?

Also datasets needs to grow a dedicated configuration option to wrap all input streams, perhaps.



was (Author: pitrou):
I would want to know first if there's actual contention due to the Python GIL and/or interpreter overhead.

In C++ the basic building block is {{TransformInputStream}}: https://arrow.apache.org/docs/cpp/api/io.html#transforming-input-wrapper . It should be easy for a normally skilled C++ developer to use it to wrap their transcoding library of choice (some might want to use ICU, others libiconv, etc.).

I think it would be ideal if we offered an optional header-only that would wrap ICU in a {{TransformInputStream}}, without actually requiring ICU to be present when compiling Arrow. Perhaps this can be through templates?

Also datasets needs to grow a dedicated configuration option to wrap all input streams, perhaps.


> [C++][Dataset] Support Latin-1 encoding
> ---------------------------------------
>
>                 Key: ARROW-16000
>                 URL: https://issues.apache.org/jira/browse/ARROW-16000
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with Latin-1 encoding.  I had a look through the docs for the Dataset API and I don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)