You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/02/03 11:14:00 UTC
[jira] [Comment Edited] (ARROW-9215) pyarrow parquet writer converts uint32 columns to int64

    [ https://issues.apache.org/jira/browse/ARROW-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277915#comment-17277915 ] 

Joris Van den Bossche edited comment on ARROW-9215 at 2/3/21, 11:13 AM:
------------------------------------------------------------------------

Yeah, I understand that for uint32 there is a "safe" alternative while for uint64 that doesn't exist (so it's certainly a trade-off). But I assume the reason we use this safe alternative (with {{version="1.0"}})  is to avoid old readers to interpret the uint32 data wrongly as its physical int32 type (which would silently give wrong numbers for values outside of the int32 range but within the uint32 range). But, the same thing currently can happen for uint64 columns, as far as I understand it correctly? 

So if we deem it OK for uint64 to write it in a way that potentially cannot be read by old readers, why not be consistent for uint32?  
(and I know "because it is possible to be safe" is a good reason to not be consistent, to be clear, just trying to understand the trade-offs/choices made here)


was (Author: jorisvandenbossche):
Yeah, I understand that for uint32 there is a "safe" alternative while for uint64 that doesn't exist (so it's certainly a trade-off). But I assume the reason we use this safe alternative (with {{version="1.0"}})  is to avoid old readers to interpret the uint32 data wrongly as its physical int32 type (which would silently give wrong numbers for values outside of the int32 range but within the uint32 range). But, the same thing currently can happen for uint64 columns, as far as I understand it correctly? 

So if we deem it OK for uint64 to write it in a way that potentially cannot be read by old readers, why not be consistent for uint32? 

> pyarrow parquet writer converts uint32 columns to int64
> -------------------------------------------------------
>
>                 Key: ARROW-9215
>                 URL: https://issues.apache.org/jira/browse/ARROW-9215
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Devavret Makkar
>            Assignee: Uwe Korn
>            Priority: Major
>
> pyarrow parquet writer changes uint32 columns to int64. This change is not made for other types and uint8, uint16, and uint64 columns retain their type.
> {code:python}
> In [1]: import pandas as pd
> In [2]: import pyarrow as pa
> In [3]: import pyarrow.parquet as pq
> In [5]: df = pd.DataFrame({'a':pd.Series([1,2,3], dtype='uint32')})
> In [6]: padf = pa.Table.from_pandas(df)
> In [7]: padf
> Out[7]: 
> pyarrow.Table
> a: uint32
> In [8]: pq.write_table(padf, 'pa.parquet')
> In [9]: pq.read_table('pa.parquet')
> Out[9]: 
> pyarrow.Table
> a: int64
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)