You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/06/29 19:41:00 UTC
[jira] [Closed] (ARROW-13214) [C++] [Parquet] uint32 does not roundtrip?
[ https://issues.apache.org/jira/browse/ARROW-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou closed ARROW-13214.
----------------------------------
Resolution: Duplicate
> [C++] [Parquet] uint32 does not roundtrip?
> ------------------------------------------
>
> Key: ARROW-13214
> URL: https://issues.apache.org/jira/browse/ARROW-13214
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet
> Reporter: Jorge Leitão
> Priority: Major
>
> I found that the following does not roundtrip:
> {code:java}
> [('generated_primitive', DataType(uint32)), ('generated_primitive', DataType(uint32))]
> [('generated_primitive_no_batches', DataType(uint32)), ('generated_primitive_no_batches', DataType(uint32))]
> [('generated_primitive_zerolength', DataType(uint32)), ('generated_primitive_zerolength', DataType(uint32))]
> {code}
> The exact code I am using for this
> {code:java}
> import os
> import pyarrow.ipc
> import pyarrow.parquet as pq
> def get_file_path(file: str):
> return f"../testing/arrow-testing/data/arrow-ipc-stream/integration/1.0.0-littleendian/{file}.arrow_file"
> def _expected(file: str):
> return pyarrow.ipc.RecordBatchFileReader(get_file_path(file)).read_all()
> def check_file(file):
> expected = _expected(file)
> path = f"{file}.parquet"
> pq.write_table(expected, path, compression=None, write_statistics=False)
> table = pq.read_table(path)
> os.remove(path)
> failing = []
> for c1, c2 in zip(expected, table):
> if c1 != c2:
> failing.append((file, c1.type))
> return failing
> for file in [
> "generated_primitive",
> "generated_primitive_no_batches",
> "generated_primitive_zerolength",
> "generated_null",
> "generated_null_trivial",
> "generated_primitive_large_offsets",
> ]:
> failing = check_file(file)
> if failing:
> print(failing)
> {code}
> Note: I generated the same parquet using the experimental parquet2 and the roundtrip succeeds, suggesting that the potential error is in writing.
> Upon further investigation, it seems that the only difference is the type: c1's type is uint32, c2's type is int64.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)