You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2021/09/19 18:42:00 UTC

[jira] [Updated] (ARROW-14037) [C++] parquet with invalid utf8 does not error

     [ https://issues.apache.org/jira/browse/ARROW-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jorge Leitão updated ARROW-14037:
---------------------------------
    Description: 
The code below likely results in undefined behavior as the data in the parquet is not valid utf8.

{code:java}
from io import BytesIO
import pyarrow
import pyarrow.parquet

// parquet with 1 column marked as string with invalid utf8
data = [
        80, 65, 82, 49, 21, 6, 21, 22, 21, 22, 92, 21, 2, 21, 0, 21, 2, 21, 0, 21, 4, 21,
        0, 18, 28, 54, 0, 40, 5, 104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111,
        0, 0, 0, 3, 1, 5, 0, 0, 0, 104, 101, 255, 108, 111, 38, 110, 28, 21, 12, 25, 37,
        6, 0, 25, 24, 2, 99, 49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5,
        104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 21, 4, 25, 44,
        72, 4, 114, 111, 111, 116, 21, 2, 0, 21, 12, 37, 2, 24, 2, 99, 49, 37, 0, 76, 28,
        0, 0, 0, 22, 2, 25, 28, 25, 28, 38, 110, 28, 21, 12, 25, 37, 6, 0, 25, 24, 2, 99,
        49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5, 104, 101, 255, 108,
        111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 22, 102, 22, 2, 0, 40, 44, 65, 114,
        114, 111, 119, 50, 32, 45, 32, 78, 97, 116, 105, 118, 101, 32, 82, 117, 115, 116,
        32, 105, 109, 112, 108, 101, 109, 101, 110, 116, 97, 116, 105, 111, 110, 32, 111,
        102, 32, 65, 114, 114, 111, 119, 0, 130, 0, 0, 0, 80, 65, 82, 49,
]
data = BytesIO(bytearray(data))

a = pyarrow.parquet.read_table(data)

print(a.column(0))
{code}

but maybe this is by design and not a concern?

  was:
The code below likely results in undefined behavior as the data in the parquet is not valid utf8.

{code:java}
from io import BytesIO
import pyarrow
import pyarrow.parquet

// parquet with 1 column marked as string with invalid utf8
data = [
        80, 65, 82, 49, 21, 6, 21, 22, 21, 22, 92, 21, 2, 21, 0, 21, 2, 21, 0, 21, 4, 21,
        0, 18, 28, 54, 0, 40, 5, 104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111,
        0, 0, 0, 3, 1, 5, 0, 0, 0, 104, 101, 255, 108, 111, 38, 110, 28, 21, 12, 25, 37,
        6, 0, 25, 24, 2, 99, 49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5,
        104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 21, 4, 25, 44,
        72, 4, 114, 111, 111, 116, 21, 2, 0, 21, 12, 37, 2, 24, 2, 99, 49, 37, 0, 76, 28,
        0, 0, 0, 22, 2, 25, 28, 25, 28, 38, 110, 28, 21, 12, 25, 37, 6, 0, 25, 24, 2, 99,
        49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5, 104, 101, 255, 108,
        111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 22, 102, 22, 2, 0, 40, 44, 65, 114,
        114, 111, 119, 50, 32, 45, 32, 78, 97, 116, 105, 118, 101, 32, 82, 117, 115, 116,
        32, 105, 109, 112, 108, 101, 109, 101, 110, 116, 97, 116, 105, 111, 110, 32, 111,
        102, 32, 65, 114, 114, 111, 119, 0, 130, 0, 0, 0, 80, 65, 82, 49,
]
data = map(lambda x: x.to_bytes(1, byteorder="little"), data)
data = BytesIO(b"".join(data))

a = pyarrow.parquet.read_table(data)

print(a.column(0))
{code}

but maybe this is by design and not a concern?


> [C++] parquet with invalid utf8 does not error
> ----------------------------------------------
>
>                 Key: ARROW-14037
>                 URL: https://issues.apache.org/jira/browse/ARROW-14037
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Jorge Leitão
>            Priority: Major
>
> The code below likely results in undefined behavior as the data in the parquet is not valid utf8.
> {code:java}
> from io import BytesIO
> import pyarrow
> import pyarrow.parquet
> // parquet with 1 column marked as string with invalid utf8
> data = [
>         80, 65, 82, 49, 21, 6, 21, 22, 21, 22, 92, 21, 2, 21, 0, 21, 2, 21, 0, 21, 4, 21,
>         0, 18, 28, 54, 0, 40, 5, 104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111,
>         0, 0, 0, 3, 1, 5, 0, 0, 0, 104, 101, 255, 108, 111, 38, 110, 28, 21, 12, 25, 37,
>         6, 0, 25, 24, 2, 99, 49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5,
>         104, 101, 255, 108, 111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 21, 4, 25, 44,
>         72, 4, 114, 111, 111, 116, 21, 2, 0, 21, 12, 37, 2, 24, 2, 99, 49, 37, 0, 76, 28,
>         0, 0, 0, 22, 2, 25, 28, 25, 28, 38, 110, 28, 21, 12, 25, 37, 6, 0, 25, 24, 2, 99,
>         49, 21, 0, 22, 2, 22, 102, 22, 102, 38, 8, 60, 54, 0, 40, 5, 104, 101, 255, 108,
>         111, 24, 5, 104, 101, 255, 108, 111, 0, 0, 0, 22, 102, 22, 2, 0, 40, 44, 65, 114,
>         114, 111, 119, 50, 32, 45, 32, 78, 97, 116, 105, 118, 101, 32, 82, 117, 115, 116,
>         32, 105, 109, 112, 108, 101, 109, 101, 110, 116, 97, 116, 105, 111, 110, 32, 111,
>         102, 32, 65, 114, 114, 111, 119, 0, 130, 0, 0, 0, 80, 65, 82, 49,
> ]
> data = BytesIO(bytearray(data))
> a = pyarrow.parquet.read_table(data)
> print(a.column(0))
> {code}
> but maybe this is by design and not a concern?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)