You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/11/09 15:30:00 UTC
[jira] [Commented] (ARROW-14564) [python] uint32 incorrectly saves
to Parquet as int64
[ https://issues.apache.org/jira/browse/ARROW-14564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441223#comment-17441223 ]
Joris Van den Bossche commented on ARROW-14564:
-----------------------------------------------
This is a limitation of the Parquet format itself. Pyarrow defaults to writing version 1.0 parquet files, which don't yet have support for uint32 type. However, you can specify {{version="2.4"}} in the {{write_table}} call, and then it should preserve both unsigned integer types.
We plan to switch to the newer parquet version by default in the near future (see ARROW-12203 and the linked issues)
> [python] uint32 incorrectly saves to Parquet as int64
> -----------------------------------------------------
>
> Key: ARROW-14564
> URL: https://issues.apache.org/jira/browse/ARROW-14564
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 6.0.0
> Environment: Ubuntu 20.10, Python 3.8.10
> Reporter: Bruce Allen
> Priority: Major
> Attachments: test_u32.py
>
>
> Function pyarrow.parquet.write_table() incorrectly saves data of type unsigned int32 as signed int64. Code test_u32.py showing failure is attached.
> Output from running test_u32.py indicating faulty retyping:
> pyarrow version: 6.0.0
> numpy data:
> [(1, 2) (3, 4)]
> [('my_u2', '<u2'), ('my_u4', '<u4')]
> result:
> my_u2 my_u4
> 0 1 2
> 1 3 4
> my_u2 uint16
> my_u4 int64
> dtype: object
>
> We can also observe that the incorrect int64 type is in the Parquet file by using the "parq" tool:
> $ parq _test_u32_pq --schema
> # Schema
> <pyarrow._parquet.ParquetSchema object at 0x7ff2e40b2a40>
> required group field_id=-1 schema {
> optional int32 field_id=-1 my_u2 (Int(bitWidth=16, isSigned=false));
> optional int64 field_id=-1 my_u4;
> }
--
This message was sent by Atlassian Jira
(v8.20.1#820001)