You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2021/06/29 19:08:00 UTC

[jira] [Created] (ARROW-13214) [C++] [Parquet] uint32 does not roundtrip?

Jorge Leitão created ARROW-13214:
------------------------------------

             Summary: [C++] [Parquet] uint32 does not roundtrip?
                 Key: ARROW-13214
                 URL: https://issues.apache.org/jira/browse/ARROW-13214
             Project: Apache Arrow
          Issue Type: Bug
          Components: Parquet
            Reporter: Jorge Leitão


I found that the following does not roundtrip:

{code:java}
[('generated_primitive', DataType(uint32)), ('generated_primitive', DataType(uint32))]
[('generated_primitive_no_batches', DataType(uint32)), ('generated_primitive_no_batches', DataType(uint32))]
[('generated_primitive_zerolength', DataType(uint32)), ('generated_primitive_zerolength', DataType(uint32))]
{code}

The exact code I am using for this

{code:java}
import os

import pyarrow.ipc
import pyarrow.parquet as pq


def get_file_path(file: str):
    return f"../testing/arrow-testing/data/arrow-ipc-stream/integration/1.0.0-littleendian/{file}.arrow_file"


def _expected(file: str):
    return pyarrow.ipc.RecordBatchFileReader(get_file_path(file)).read_all()


def check_file(file):
    expected = _expected(file)
    path = f"{file}.parquet"

    pq.write_table(expected, path, compression=None, write_statistics=False)

    table = pq.read_table(path)
    os.remove(path)

    failing = []
    for c1, c2 in zip(expected, table):
        if c1 != c2:
            failing.append((file, c1.type))
    return failing


for file in [
    "generated_primitive",
    "generated_primitive_no_batches",
    "generated_primitive_zerolength",
    "generated_null",
    "generated_null_trivial",
    "generated_primitive_large_offsets",
]:
    failing = check_file(file)
    if failing:
        print(failing)
{code}

Note: I generated the same parquet using the experimental parquet2 and the roundtrip succeeds, suggesting that the potential error is in writing.

Upon further investigation, it seems that the only difference is the type: c1's type is uint32, c2's type is int64.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)