You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joao Moreira (Jira)" <ji...@apache.org> on 2021/07/14 18:33:00 UTC

[jira] [Created] (ARROW-13342) Categorical boolean column saved as regular boolean in parquet

Joao Moreira created ARROW-13342:
------------------------------------

             Summary: Categorical boolean column saved as regular boolean in parquet
                 Key: ARROW-13342
                 URL: https://issues.apache.org/jira/browse/ARROW-13342
             Project: Apache Arrow
          Issue Type: Bug
          Components: Parquet, Python
    Affects Versions: 4.0.1
            Reporter: Joao Moreira


When saving a pandas dataframe to parquet, if there is a categorical column where the categories are boolean, the column is saved as regular boolean.

This causes an issue because, when reading back the parquet file, I expect the column to still be categorical.

 
Reproducible example:
{code:python}
import pandas as pd
import pyarrow

# Create dataframe with boolean column that is then converted to categorical
df = pd.DataFrame({'a': [True, True, False, True, False]})
df['a'] = df['a'].astype('category')

# Convert to arrow Table and save to disk
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_table(table, 'test.parquet')

# Reload data and convert back to pandas
table_rel = pyarrow.parquet.read_table('test.parquet')
df_rel = table_rel.to_pandas()
{code}

The arrow {{table}} variable correctly converts the column to an arrow {{DICTIONARY}} type:
{noformat}
>>> df['a']
0     True
1     True
2    False
3     True
4    False
Name: a, dtype: category
Categories (2, object): [False, True]
>>>
>>> table
pyarrow.Table
a: dictionary<values=bool, indices=int8, ordered=0>
{noformat}

However, the reloaded column is now a regular boolean:
{noformat}
>>> table_rel
pyarrow.Table
a: bool
>>>
>>> df_rel['a']
0     True
1     True
2    False
3     True
4    False
Name: a, dtype: bool
{noformat}

I would have expected the column to be read back as categorical.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)