You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/06/29 19:41:00 UTC

[jira] [Closed] (ARROW-13214) [C++] [Parquet] uint32 does not roundtrip?

     [ https://issues.apache.org/jira/browse/ARROW-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoine Pitrou closed ARROW-13214.
----------------------------------
    Resolution: Duplicate

> [C++] [Parquet] uint32 does not roundtrip?
> ------------------------------------------
>
>                 Key: ARROW-13214
>                 URL: https://issues.apache.org/jira/browse/ARROW-13214
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet
>            Reporter: Jorge Leitão
>            Priority: Major
>
> I found that the following does not roundtrip:
> {code:java}
> [('generated_primitive', DataType(uint32)), ('generated_primitive', DataType(uint32))]
> [('generated_primitive_no_batches', DataType(uint32)), ('generated_primitive_no_batches', DataType(uint32))]
> [('generated_primitive_zerolength', DataType(uint32)), ('generated_primitive_zerolength', DataType(uint32))]
> {code}
> The exact code I am using for this
> {code:java}
> import os
> import pyarrow.ipc
> import pyarrow.parquet as pq
> def get_file_path(file: str):
>     return f"../testing/arrow-testing/data/arrow-ipc-stream/integration/1.0.0-littleendian/{file}.arrow_file"
> def _expected(file: str):
>     return pyarrow.ipc.RecordBatchFileReader(get_file_path(file)).read_all()
> def check_file(file):
>     expected = _expected(file)
>     path = f"{file}.parquet"
>     pq.write_table(expected, path, compression=None, write_statistics=False)
>     table = pq.read_table(path)
>     os.remove(path)
>     failing = []
>     for c1, c2 in zip(expected, table):
>         if c1 != c2:
>             failing.append((file, c1.type))
>     return failing
> for file in [
>     "generated_primitive",
>     "generated_primitive_no_batches",
>     "generated_primitive_zerolength",
>     "generated_null",
>     "generated_null_trivial",
>     "generated_primitive_large_offsets",
> ]:
>     failing = check_file(file)
>     if failing:
>         print(failing)
> {code}
> Note: I generated the same parquet using the experimental parquet2 and the roundtrip succeeds, suggesting that the potential error is in writing.
> Upon further investigation, it seems that the only difference is the type: c1's type is uint32, c2's type is int64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)