You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Jonas Nelle (Jira)" <ji...@apache.org> on 2020/05/15 08:50:00 UTC
[jira] [Created] (ARROW-8812) Columns of type CategoricalIndex
fails to be read back
Jonas Nelle created ARROW-8812:
----------------------------------
Summary: Columns of type CategoricalIndex fails to be read back
Key: ARROW-8812
URL: https://issues.apache.org/jira/browse/ARROW-8812
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.15.1
Environment: Python 3.7.7
MacOS (Darwin-19.4.0-x86_64-i386-64bit)
Pandas 1.0.3
Pyarrow 0.15.1
Reporter: Jonas Nelle
When columns are of type {{CategoricalIndex}}, saving and reading the table back causes a {{TypeError: data type "categorical" not understood}}:
{code:python}
import pandas as pd
from pyarrow import parquet, Table
base_df = pd.DataFrame([['foo', 'j', "1"],
['bar', 'j', "1"],
['foo', 'j', "1"],
['foobar', 'j', "1"]],
columns=['my_cat', 'var', 'for_count'])
base_df['my_cat'] = base_df['my_cat'].astype('category')
df = (
base_df
.groupby(["my_cat", "var"], observed=True)
.agg({"for_count": "count"})
.rename(columns={"for_count": "my_cat_counts"})
.unstack(level="my_cat", fill_value=0)
)
print(df)
{code}
The resulting data frame looks something like this:
|| ||my_cat_counts|| || ||
|my_cat|foo|bar|foobar|
|var| | | |
|j|2|1|1|
Then, writing and reading causes the {{KeyError}}:
{code:python}
parquet.write_table(Table.from_pandas(df), "test.pqt")
parquet.read_table("test.pqt").to_pandas()
> TypeError: data type "categorical" not understood
{code}
In the example, the column is also a MultiIndex, but that isn't the problem:
{code:python}
df.columns = df.columns.get_level_values(1)
parquet.write_table(Table.from_pandas(df), "test.pqt")
parquet.read_table("test.pqt").to_pandas()
> TypeError: data type "categorical" not understood
{code}
This is the workaround [suggested on stackoverflow|https://stackoverflow.com/questions/55749399/how-to-fix-the-issue-of-categoricalindex-column-in-pandas]:
{code:python}
df.columns = pd.Index(list(df.columns)) # suggested fix for the time being
parquet.write_table(Table.from_pandas(df), "test.pqt")
parquet.read_table("test.pqt").to_pandas() # no error
{code}
Are there any plans to support the pattern described here in the future?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)