You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Jonas Nelle (Jira)" <ji...@apache.org> on 2020/05/15 08:50:00 UTC

[jira] [Created] (ARROW-8812) Columns of type CategoricalIndex fails to be read back

Jonas Nelle created ARROW-8812:
----------------------------------

             Summary: Columns of type CategoricalIndex fails to be read back
                 Key: ARROW-8812
                 URL: https://issues.apache.org/jira/browse/ARROW-8812
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1
         Environment: Python 3.7.7
MacOS (Darwin-19.4.0-x86_64-i386-64bit)
Pandas 1.0.3
Pyarrow 0.15.1
            Reporter: Jonas Nelle


When columns are of type {{CategoricalIndex}}, saving and reading the table back causes a {{TypeError: data type "categorical" not understood}}:
{code:python}
import pandas as pd
from pyarrow import parquet, Table

base_df = pd.DataFrame([['foo', 'j', "1"],
                        ['bar', 'j', "1"],
                        ['foo', 'j', "1"],
                        ['foobar', 'j', "1"]],
                       columns=['my_cat', 'var', 'for_count'])

base_df['my_cat'] = base_df['my_cat'].astype('category')

df = (
    base_df
    .groupby(["my_cat", "var"], observed=True)
    .agg({"for_count": "count"})
    .rename(columns={"for_count": "my_cat_counts"})
    .unstack(level="my_cat", fill_value=0)
)

print(df)
{code}
The resulting data frame looks something like this:
|| ||my_cat_counts|| || ||
|my_cat|foo|bar|foobar|
|var| | | |
|j|2|1|1|

Then, writing and reading causes the {{KeyError}}:
{code:python}
parquet.write_table(Table.from_pandas(df), "test.pqt")
parquet.read_table("test.pqt").to_pandas()
> TypeError: data type "categorical" not understood
{code}
In the example, the column is also a MultiIndex, but that isn't the problem:
{code:python}
df.columns = df.columns.get_level_values(1)
parquet.write_table(Table.from_pandas(df), "test.pqt")
parquet.read_table("test.pqt").to_pandas()
> TypeError: data type "categorical" not understood
{code}
This is the workaround [suggested on stackoverflow|https://stackoverflow.com/questions/55749399/how-to-fix-the-issue-of-categoricalindex-column-in-pandas]:
{code:python}
df.columns = pd.Index(list(df.columns)) # suggested fix for the time being
parquet.write_table(Table.from_pandas(df), "test.pqt")
parquet.read_table("test.pqt").to_pandas() # no error
{code}
Are there any plans to support the pattern described here in the future?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)