You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/11/09 15:30:00 UTC

[jira] [Commented] (ARROW-14564) [python] uint32 incorrectly saves to Parquet as int64

    [ https://issues.apache.org/jira/browse/ARROW-14564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441223#comment-17441223 ] 

Joris Van den Bossche commented on ARROW-14564:
-----------------------------------------------

This is a limitation of the Parquet format itself. Pyarrow defaults to writing version 1.0 parquet files, which don't yet have support for uint32 type. However, you can specify {{version="2.4"}}  in the {{write_table}} call, and then it should preserve both unsigned integer types. 

We plan to switch to the newer parquet version by default in the near future (see ARROW-12203 and the linked issues)

> [python] uint32 incorrectly saves to Parquet as int64
> -----------------------------------------------------
>
>                 Key: ARROW-14564
>                 URL: https://issues.apache.org/jira/browse/ARROW-14564
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 6.0.0
>         Environment: Ubuntu 20.10, Python 3.8.10
>            Reporter: Bruce Allen
>            Priority: Major
>         Attachments: test_u32.py
>
>
> Function pyarrow.parquet.write_table() incorrectly saves data of type unsigned int32 as signed int64.  Code test_u32.py showing failure is attached.
> Output from running test_u32.py indicating faulty retyping:
> pyarrow version: 6.0.0
> numpy data:
> [(1, 2) (3, 4)]
> [('my_u2', '<u2'), ('my_u4', '<u4')]
> result:
>  my_u2 my_u4
> 0 1 2
> 1 3 4
> my_u2 uint16
> my_u4 int64
> dtype: object
>  
> We can also observe that the incorrect int64 type is in the Parquet file by using the "parq" tool:
> $ parq _test_u32_pq --schema
> # Schema 
>  <pyarrow._parquet.ParquetSchema object at 0x7ff2e40b2a40>
> required group field_id=-1 schema {
>  optional int32 field_id=-1 my_u2 (Int(bitWidth=16, isSigned=false));
>  optional int64 field_id=-1 my_u4;
> }



--
This message was sent by Atlassian Jira
(v8.20.1#820001)