You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2021/09/19 14:46:00 UTC

[jira] [Created] (ARROW-14037) [C++] parquet with invalid utf8 does not error

Jorge Leitão created ARROW-14037:
------------------------------------

             Summary: [C++] parquet with invalid utf8 does not error
                 Key: ARROW-14037
                 URL: https://issues.apache.org/jira/browse/ARROW-14037
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Python
            Reporter: Jorge Leitão


The code below likely results in undefined behavior as the data in the parquet is not valid utf8.

{code:java}
from io import BytesIO
import pyarrow
import pyarrow.parquet

// parquet with 1 column marked as string with invalid utf8
data = [
        80, 65, 82, 49, 21, 6, 21, 22, 21, 22, 92, 21, 2, 21, 0, 21, 2, 21, 0, 21, 4, 21,
        0, 18, 28, 54, 0, 40, 5, 104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111,
        0, 0, 0, 3, 1, 5, 0, 0, 0, 104, 101, 255, 108, 111, 38, 110, 28, 21, 12, 25, 37,
        6, 0, 25, 24, 2, 99, 49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5,
        104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 21, 4, 25, 44,
        72, 4, 114, 111, 111, 116, 21, 2, 0, 21, 12, 37, 2, 24, 2, 99, 49, 37, 0, 76, 28,
        0, 0, 0, 22, 2, 25, 28, 25, 28, 38, 110, 28, 21, 12, 25, 37, 6, 0, 25, 24, 2, 99,
        49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5, 104, 101, 255, 108,
        111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 22, 102, 22, 2, 0, 40, 44, 65, 114,
        114, 111, 119, 50, 32, 45, 32, 78, 97, 116, 105, 118, 101, 32, 82, 117, 115, 116,
        32, 105, 109, 112, 108, 101, 109, 101, 110, 116, 97, 116, 105, 111, 110, 32, 111,
        102, 32, 65, 114, 114, 111, 119, 0, 130, 0, 0, 0, 80, 65, 82, 49,
]
data = map(lambda x: x.to_bytes(1, byteorder="little"), data)
data = BytesIO(b"".join(data))

a = pyarrow.parquet.read_table(data)

print(a.column(0))
{code}

but maybe this is by design and not a concern?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)