You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2021/09/19 18:48:00 UTC

[jira] [Comment Edited] (ARROW-14037) [C++] parquet with invalid utf8 does not error

    [ https://issues.apache.org/jira/browse/ARROW-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417379#comment-17417379 ] 

Jorge Leitão edited comment on ARROW-14037 at 9/19/21, 6:47 PM:
----------------------------------------------------------------

Thanks; I have updated the example.

The context of this JIRA is that, in Rust, performing this operation must be marked as "unsafe" because it can result in UB. https://github.com/apache/arrow-rs/issues/786. Rust and C++ are similar in this context (of triggering UB in non-valid utf8), which is why I raised this here also.

If I am understanding correctly, the design is that it is the user of `Utf8Array`'s responsibility to always validate the array prior to using it, to ensure no UB is triggered? If that is the case, then I think we should mark this as won't fix / work as intended.


was (Author: jorgecarleitao):
Thanks; I have updated the example.

The context of this JIRA is that, in Rust, performing this operation must be marked as "unsafe" because it can result in UB. https://github.com/apache/arrow-rs/issues/786. Rust and C++ are similar in this context, which is why I raised this here also.

If I am understanding correctly, the design is that it is the user of `Utf8Array`'s responsibility to always validate the array prior to using it, to ensure no UB is triggered? If that is the case, then I think we should mark this as won't fix / work as intended.

> [C++] parquet with invalid utf8 does not error
> ----------------------------------------------
>
>                 Key: ARROW-14037
>                 URL: https://issues.apache.org/jira/browse/ARROW-14037
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Jorge Leitão
>            Priority: Major
>
> The code below likely results in undefined behavior as the data in the parquet is not valid utf8.
> {code:java}
> from io import BytesIO
> import pyarrow
> import pyarrow.parquet
> // parquet with 1 column marked as string with invalid utf8
> data = [
>         80, 65, 82, 49, 21, 6, 21, 22, 21, 22, 92, 21, 2, 21, 0, 21, 2, 21, 0, 21, 4, 21,
>         0, 18, 28, 54, 0, 40, 5, 104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111,
>         0, 0, 0, 3, 1, 5, 0, 0, 0, 104, 101, 255, 108, 111, 38, 110, 28, 21, 12, 25, 37,
>         6, 0, 25, 24, 2, 99, 49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5,
>         104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 21, 4, 25, 44,
>         72, 4, 114, 111, 111, 116, 21, 2, 0, 21, 12, 37, 2, 24, 2, 99, 49, 37, 0, 76, 28,
>         0, 0, 0, 22, 2, 25, 28, 25, 28, 38, 110, 28, 21, 12, 25, 37, 6, 0, 25, 24, 2, 99,
>         49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5, 104, 101, 255, 108,
>         111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 22, 102, 22, 2, 0, 40, 44, 65, 114,
>         114, 111, 119, 50, 32, 45, 32, 78, 97, 116, 105, 118, 101, 32, 82, 117, 115, 116,
>         32, 105, 109, 112, 108, 101, 109, 101, 110, 116, 97, 116, 105, 111, 110, 32, 111,
>         102, 32, 65, 114, 114, 111, 119, 0, 130, 0, 0, 0, 80, 65, 82, 49,
> ]
> data = BytesIO(bytearray(data))
> a = pyarrow.parquet.read_table(data)
> print(a.column(0))
> {code}
> but maybe this is by design and not a concern?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)