You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Naga (JIRA)" <ji...@apache.org> on 2019/08/02 04:33:00 UTC

[jira] [Created] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

Naga created ARROW-6114:
---------------------------

             Summary: Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow
                 Key: ARROW-6114
                 URL: https://issues.apache.org/jira/browse/ARROW-6114
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.14.1
         Environment: Python 3.7.3
pyarrow 0.14.1
            Reporter: Naga


h3. Datatypes are not preserved when a pandas data frame is *partitioned* and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
From the above output, we could see that the data type for age is int64 in the original pandas data frame but it got changed to object when we saved to local and loaded back.
{code}
*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}

*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)