You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2021/09/19 14:46:00 UTC
[jira] [Created] (ARROW-14037) [C++] parquet with invalid utf8 does
not error
Jorge Leitão created ARROW-14037:
------------------------------------
Summary: [C++] parquet with invalid utf8 does not error
Key: ARROW-14037
URL: https://issues.apache.org/jira/browse/ARROW-14037
Project: Apache Arrow
Issue Type: Improvement
Components: C++, Python
Reporter: Jorge Leitão
The code below likely results in undefined behavior as the data in the parquet is not valid utf8.
{code:java}
from io import BytesIO
import pyarrow
import pyarrow.parquet
// parquet with 1 column marked as string with invalid utf8
data = [
80, 65, 82, 49, 21, 6, 21, 22, 21, 22, 92, 21, 2, 21, 0, 21, 2, 21, 0, 21, 4, 21,
0, 18, 28, 54, 0, 40, 5, 104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111,
0, 0, 0, 3, 1, 5, 0, 0, 0, 104, 101, 255, 108, 111, 38, 110, 28, 21, 12, 25, 37,
6, 0, 25, 24, 2, 99, 49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5,
104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 21, 4, 25, 44,
72, 4, 114, 111, 111, 116, 21, 2, 0, 21, 12, 37, 2, 24, 2, 99, 49, 37, 0, 76, 28,
0, 0, 0, 22, 2, 25, 28, 25, 28, 38, 110, 28, 21, 12, 25, 37, 6, 0, 25, 24, 2, 99,
49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5, 104, 101, 255, 108,
111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 22, 102, 22, 2, 0, 40, 44, 65, 114,
114, 111, 119, 50, 32, 45, 32, 78, 97, 116, 105, 118, 101, 32, 82, 117, 115, 116,
32, 105, 109, 112, 108, 101, 109, 101, 110, 116, 97, 116, 105, 111, 110, 32, 111,
102, 32, 65, 114, 114, 111, 119, 0, 130, 0, 0, 0, 80, 65, 82, 49,
]
data = map(lambda x: x.to_bytes(1, byteorder="little"), data)
data = BytesIO(b"".join(data))
a = pyarrow.parquet.read_table(data)
print(a.column(0))
{code}
but maybe this is by design and not a concern?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)