You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicholas Palko (Jira)" <ji...@apache.org> on 2020/06/15 13:02:00 UTC

[jira] [Created] (ARROW-9134) Parquet partitioning degrades Int32 to float64

Nicholas Palko created ARROW-9134:
-------------------------------------

             Summary: Parquet partitioning degrades Int32 to float64
                 Key: ARROW-9134
                 URL: https://issues.apache.org/jira/browse/ARROW-9134
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Nicholas Palko


As you can see below, as soon as I partition the parquet dataset, my {{Int32}} type is read back as {{float64}}. This seems like a bug to me, as partitioning shouldn't change the datatype, and I loose all the advantages of the nullable int.

 
{code:java}
import pandas as pd # 1.0.4
import pyarrow as pa # 0.17.1
import pyarrow.parquet as pq

x = pd.DataFrame({'a':[1, 2, None, 1], 'b':['x']*4})
x.a = x.a.astype('Int32')
tbl = pa.Table.from_pandas(x)
pq.write_to_dataset(tbl, 'ok')
pq.write_to_dataset(tbl, 'busted', partition_cols=['b'])

print(pd.read_parquet('ok').dtypes['a'])  # Int32
print(pd.read_parquet('busted').dtypes['a'])  # float64
{code}
 

(cross-posted on stackoverflow) 

[https://stackoverflow.com/questions/62356730/parquet-partitioning-degrades-int32-to-float64]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)