You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/08/02 09:37:00 UTC
[jira] [Commented] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

    [ https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898737#comment-16898737 ] 

Joris Van den Bossche commented on ARROW-6114:
----------------------------------------------

[~bnriiitb] thanks for opening the issue. 

So when a partitioned dataset is written, the partition columns are not stored in the actual data, but are part of the directory schema (in your case you would have "age=77", "age=32", etc sub-folders). 

Currently, we don't save any "meta data" about the columns used to partition, and since they are also not stored in the actual parquet files (where a schema of the data is stored), we don't have that information from there either.

So when reading a partitioned dataset, (py)arrow has not much information about the type of this partition column. Currently, the logic is to try to convert the values to ints and otherwise leave as strings, and then those values are converted to a Dictionary type (corresponding to categorical type in pandas). This logic is here: https://github.com/apache/arrow/blob/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5/python/pyarrow/parquet.py#L585-L609

There is currently no option to change this. So right now, the workaround is to convert the categorical back to an integer column in pandas.  
But longer term, we should maybe think about storing the type of the partition keys as metadata, and an option to restore it as a dictionary column or not.

Related issues about the type of the partition column: ARROW-3388 (booleans as strings), ARROW-5666 (strings with underscores interpreted as int)

> Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-6114
>                 URL: https://issues.apache.org/jira/browse/ARROW-6114
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.14.1
>         Environment: Python 3.7.3
> pyarrow 0.14.1
>            Reporter: Naga
>            Priority: Major
>              Labels: parquet
>
> h3. Datatypes are not preserved when a pandas data frame is *partitioned* and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned.
> *Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
> {code:java}
> # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
> import pandas as pd
> df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test'
> partition_cols=['age']
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False)
>  # Loading a dataset partioned parquet dataset from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> name object
> age category
> dtype: object
> {code}
> h5. {color:#d04437}From the above output, we could see that the data type for age is int64 in the original pandas data frame but it got changed to category when we saved to local and loaded back.{color}
> *Case 2: Non-partitioned dataset - Data types are preserved*
> {code:java}
> import pandas as pd
> print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow')
> df = pd.DataFrame(
> {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test_without_partition'
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, preserve_index=False)
>  # Loading a non-partioned parquet file from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> age int64
> name object
> dtype: object
> {code}
> *Versions*
>  * Python 3.7.3
>  * pyarrow 0.14.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)