You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/05/15 15:09:00 UTC
[jira] [Updated] (ARROW-8812) [Python] Columns of type
CategoricalIndex fails to be read back
[ https://issues.apache.org/jira/browse/ARROW-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-8812:
--------------------------------
Summary: [Python] Columns of type CategoricalIndex fails to be read back (was: Columns of type CategoricalIndex fails to be read back)
> [Python] Columns of type CategoricalIndex fails to be read back
> ---------------------------------------------------------------
>
> Key: ARROW-8812
> URL: https://issues.apache.org/jira/browse/ARROW-8812
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.15.1
> Environment: Python 3.7.7
> MacOS (Darwin-19.4.0-x86_64-i386-64bit)
> Pandas 1.0.3
> Pyarrow 0.15.1
> Reporter: Jonas Nelle
> Priority: Minor
> Labels: parquet
>
> When columns are of type {{CategoricalIndex}}, saving and reading the table back causes a {{TypeError: data type "categorical" not understood}}:
> {code:python}
> import pandas as pd
> from pyarrow import parquet, Table
> base_df = pd.DataFrame([['foo', 'j', "1"],
> ['bar', 'j', "1"],
> ['foo', 'j', "1"],
> ['foobar', 'j', "1"]],
> columns=['my_cat', 'var', 'for_count'])
> base_df['my_cat'] = base_df['my_cat'].astype('category')
> df = (
> base_df
> .groupby(["my_cat", "var"], observed=True)
> .agg({"for_count": "count"})
> .rename(columns={"for_count": "my_cat_counts"})
> .unstack(level="my_cat", fill_value=0)
> )
> print(df)
> {code}
> The resulting data frame looks something like this:
> || ||my_cat_counts|| || ||
> |my_cat|foo|bar|foobar|
> |var| | | |
> |j|2|1|1|
> Then, writing and reading causes the {{KeyError}}:
> {code:python}
> parquet.write_table(Table.from_pandas(df), "test.pqt")
> parquet.read_table("test.pqt").to_pandas()
> > TypeError: data type "categorical" not understood
> {code}
> In the example, the column is also a MultiIndex, but that isn't the problem:
> {code:python}
> df.columns = df.columns.get_level_values(1)
> parquet.write_table(Table.from_pandas(df), "test.pqt")
> parquet.read_table("test.pqt").to_pandas()
> > TypeError: data type "categorical" not understood
> {code}
> This is the workaround [suggested on stackoverflow|https://stackoverflow.com/questions/55749399/how-to-fix-the-issue-of-categoricalindex-column-in-pandas]:
> {code:python}
> df.columns = pd.Index(list(df.columns)) # suggested fix for the time being
> parquet.write_table(Table.from_pandas(df), "test.pqt")
> parquet.read_table("test.pqt").to_pandas() # no error
> {code}
> Are there any plans to support the pattern described here in the future?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)