You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/06/15 15:00:00 UTC

[jira] [Commented] (ARROW-9134) [Python] Parquet partitioning degrades Int32 to float64

    [ https://issues.apache.org/jira/browse/ARROW-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135942#comment-17135942 ] 

Joris Van den Bossche commented on ARROW-9134:
----------------------------------------------

This is working correctly on pyarrow master for me:

{code}
In [49]: pd.read_parquet('busted').dtypes                                                                                                                                                                  
Out[49]: 
a       Int32
b    category
dtype: object
{code}

I suppose fixed by ARROW-8251

Inside {{write_to_dataset}} we have some back and forth pandas conversion, and before we didn't preserve the pandas metadata there, so the int with null got converted into float.

> [Python] Parquet partitioning degrades Int32 to float64
> -------------------------------------------------------
>
>                 Key: ARROW-9134
>                 URL: https://issues.apache.org/jira/browse/ARROW-9134
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Nicholas Palko
>            Priority: Major
>             Fix For: 1.0.0
>
>
> As you can see below, as soon as I partition the parquet dataset, my {{Int32}} type is read back as {{float64}}. This seems like a bug to me, as partitioning shouldn't change the datatype, and I loose all the advantages of the nullable int.
>  
> {code:java}
> import pandas as pd # 1.0.4
> import pyarrow as pa # 0.17.1
> import pyarrow.parquet as pq
> x = pd.DataFrame({'a':[1, 2, None, 1], 'b':['x']*4})
> x.a = x.a.astype('Int32')
> tbl = pa.Table.from_pandas(x)
> pq.write_to_dataset(tbl, 'ok')
> pq.write_to_dataset(tbl, 'busted', partition_cols=['b'])
> print(pd.read_parquet('ok').dtypes['a'])  # Int32
> print(pd.read_parquet('busted').dtypes['a'])  # float64
> {code}
>  
> (cross-posted on stackoverflow) 
> [https://stackoverflow.com/questions/62356730/parquet-partitioning-degrades-int32-to-float64]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)