You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/04/28 14:28:01 UTC

[jira] [Assigned] (ARROW-8251) [Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset

     [ https://issues.apache.org/jira/browse/ARROW-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche reassigned ARROW-8251:
--------------------------------------------

    Assignee: Joris Van den Bossche

> [Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset
> --------------------------------------------------------------------------------
>
>                 Key: ARROW-8251
>                 URL: https://issues.apache.org/jira/browse/ARROW-8251
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>         Environment: pandas 1.0.1
> parquet 0.16
>            Reporter: Ged Steponavicius
>            Assignee: Joris Van den Bossche
>            Priority: Major
>
> write_to_dataset with pandas fields using pandas.ExtensionDtype nullable int or string produce parquet file which when read back in has different dtypes than original df
> {code:java}
> import pandas as pd 
> import pyarrow as pa 
> import pyarrow.parquet as pq 
> parquet_dataset = 'partquet_dataset/' 
> parquet_file = 'test.parquet' 
> df = pd.DataFrame([{'str_col':'abc','int_col':1,'part':1}, {'str_col':np.nan,'int_col':np.nan,'part':1}]) 
> df['str_col'] = df['str_col'].astype(pd.StringDtype()) 
> df['int_col'] = df['int_col'].astype(pd.Int64Dtype()) 
> table = pa.Table.from_pandas(df) 
> pq.write_to_dataset(table, root_path=parquet_dataset, partition_cols=['part'] ) pq.write_table(table, where=parquet_file) {code}
> write_table handles schema correctly, pandas.ExtensionDtype survive round trip:
> {code:java}
> pq.read_table(parquet_file).to_pandas().dtypes 
> str_col string 
> int_col Int64 
> part int64 {code}
> However, write_to_dataset reverts back to object/float:
> {code:java}
> pq.read_table(parquet_dataset).to_pandas().dtypes 
> str_col object 
> int_col float64 
> part category {code}
> I have also tried writing common metadata at the top-level directory of a partitioned dataset and then passing metadata to read_table, but results are the same as without metadata
> {code:java}
> pq.write_metadata(table.schema, parquet_dataset+'_common_metadata', version='2.0') meta = pq.read_metadata(parquet_dataset+'_common_metadata') pq.read_table(parquet_dataset,metadata=meta).to_pandas().dtypes {code}
> This also affects pandas to_parquet when partition_cols is specified:
> {code:java}
> df.to_parquet(path = parquet_dataset, partition_cols=['part']) pd.read_parquet(parquet_dataset).dtypes 
> str_col object 
> int_col float64 
> part category {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)