You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/09/19 16:05:00 UTC
[jira] [Commented] (ARROW-14037) [C++] parquet with invalid utf8
does not error
[ https://issues.apache.org/jira/browse/ARROW-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417361#comment-17417361 ]
Antoine Pitrou commented on ARROW-14037:
----------------------------------------
Indeed, we don't check that the data is valid on loading.
You could call {{a.validate(full=True)}} and should get an error, though.
> [C++] parquet with invalid utf8 does not error
> ----------------------------------------------
>
> Key: ARROW-14037
> URL: https://issues.apache.org/jira/browse/ARROW-14037
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Jorge Leitão
> Priority: Major
>
> The code below likely results in undefined behavior as the data in the parquet is not valid utf8.
> {code:java}
> from io import BytesIO
> import pyarrow
> import pyarrow.parquet
> // parquet with 1 column marked as string with invalid utf8
> data = [
> 80, 65, 82, 49, 21, 6, 21, 22, 21, 22, 92, 21, 2, 21, 0, 21, 2, 21, 0, 21, 4, 21,
> 0, 18, 28, 54, 0, 40, 5, 104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111,
> 0, 0, 0, 3, 1, 5, 0, 0, 0, 104, 101, 255, 108, 111, 38, 110, 28, 21, 12, 25, 37,
> 6, 0, 25, 24, 2, 99, 49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5,
> 104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 21, 4, 25, 44,
> 72, 4, 114, 111, 111, 116, 21, 2, 0, 21, 12, 37, 2, 24, 2, 99, 49, 37, 0, 76, 28,
> 0, 0, 0, 22, 2, 25, 28, 25, 28, 38, 110, 28, 21, 12, 25, 37, 6, 0, 25, 24, 2, 99,
> 49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5, 104, 101, 255, 108,
> 111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 22, 102, 22, 2, 0, 40, 44, 65, 114,
> 114, 111, 119, 50, 32, 45, 32, 78, 97, 116, 105, 118, 101, 32, 82, 117, 115, 116,
> 32, 105, 109, 112, 108, 101, 109, 101, 110, 116, 97, 116, 105, 111, 110, 32, 111,
> 102, 32, 65, 114, 114, 111, 119, 0, 130, 0, 0, 0, 80, 65, 82, 49,
> ]
> data = map(lambda x: x.to_bytes(1, byteorder="little"), data)
> data = BytesIO(b"".join(data))
> a = pyarrow.parquet.read_table(data)
> print(a.column(0))
> {code}
> but maybe this is by design and not a concern?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)