You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Ged Steponavicius (Jira)" <ji...@apache.org> on 2020/03/28 10:20:00 UTC

[jira] [Created] (ARROW-8251) [Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset

Ged Steponavicius created ARROW-8251:
----------------------------------------

             Summary: [Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset
                 Key: ARROW-8251
                 URL: https://issues.apache.org/jira/browse/ARROW-8251
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.16.0
         Environment: pandas 1.0.1
parquet 0.16
            Reporter: Ged Steponavicius


write_to_dataset with pandas fields using pandas.ExtensionDtype nullable int or string produce parquet file which when read back in has different dtypes than original df
{code:java}
import pandas as pd 
import pyarrow as pa 
import pyarrow.parquet as pq 
parquet_dataset = 'partquet_dataset/' 
parquet_file = 'test.parquet' 

df = pd.DataFrame([{'str_col':'abc','int_col':1,'part':1}, {'str_col':np.nan,'int_col':np.nan,'part':1}]) 
df['str_col'] = df['str_col'].astype(pd.StringDtype()) 
df['int_col'] = df['int_col'].astype(pd.Int64Dtype()) 

table = pa.Table.from_pandas(df) 

pq.write_to_dataset(table, root_path=parquet_dataset, partition_cols=['part'] ) pq.write_table(table, where=parquet_file) {code}
write_table handles schema correctly, pandas.ExtensionDtype survive round trip:
{code:java}
pq.read_table(parquet_file).to_pandas().dtypes 
str_col string 
int_col Int64 
part int64 {code}
However, write_to_dataset reverts back to object/float:
{code:java}
pq.read_table(parquet_dataset).to_pandas().dtypes 
str_col object 
int_col float64 
part category {code}
I have also tried writing common metadata at the top-level directory of a partitioned dataset and then passing metadata to read_table, but results are the same as without metadata
{code:java}
pq.write_metadata(table.schema, parquet_dataset+'_common_metadata', version='2.0') meta = pq.read_metadata(parquet_dataset+'_common_metadata') pq.read_table(parquet_dataset,metadata=meta).to_pandas().dtypes {code}
This also affects pandas to_parquet when partition_cols is specified:
{code:java}
df.to_parquet(path = parquet_dataset, partition_cols=['part']) pd.read_parquet(parquet_dataset).dtypes 
str_col object 
int_col float64 
part category {code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)