You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/05/15 15:09:00 UTC
[jira] [Updated] (ARROW-8812) [Python] Columns of type CategoricalIndex fails to be read back

     [ https://issues.apache.org/jira/browse/ARROW-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney updated ARROW-8812:
--------------------------------
    Summary: [Python] Columns of type CategoricalIndex fails to be read back  (was: Columns of type CategoricalIndex fails to be read back)

> [Python] Columns of type CategoricalIndex fails to be read back
> ---------------------------------------------------------------
>
>                 Key: ARROW-8812
>                 URL: https://issues.apache.org/jira/browse/ARROW-8812
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: Python 3.7.7
> MacOS (Darwin-19.4.0-x86_64-i386-64bit)
> Pandas 1.0.3
> Pyarrow 0.15.1
>            Reporter: Jonas Nelle
>            Priority: Minor
>              Labels: parquet
>
> When columns are of type {{CategoricalIndex}}, saving and reading the table back causes a {{TypeError: data type "categorical" not understood}}:
> {code:python}
> import pandas as pd
> from pyarrow import parquet, Table
> base_df = pd.DataFrame([['foo', 'j', "1"],
>                         ['bar', 'j', "1"],
>                         ['foo', 'j', "1"],
>                         ['foobar', 'j', "1"]],
>                        columns=['my_cat', 'var', 'for_count'])
> base_df['my_cat'] = base_df['my_cat'].astype('category')
> df = (
>     base_df
>     .groupby(["my_cat", "var"], observed=True)
>     .agg({"for_count": "count"})
>     .rename(columns={"for_count": "my_cat_counts"})
>     .unstack(level="my_cat", fill_value=0)
> )
> print(df)
> {code}
> The resulting data frame looks something like this:
> || ||my_cat_counts|| || ||
> |my_cat|foo|bar|foobar|
> |var| | | |
> |j|2|1|1|
> Then, writing and reading causes the {{KeyError}}:
> {code:python}
> parquet.write_table(Table.from_pandas(df), "test.pqt")
> parquet.read_table("test.pqt").to_pandas()
> > TypeError: data type "categorical" not understood
> {code}
> In the example, the column is also a MultiIndex, but that isn't the problem:
> {code:python}
> df.columns = df.columns.get_level_values(1)
> parquet.write_table(Table.from_pandas(df), "test.pqt")
> parquet.read_table("test.pqt").to_pandas()
> > TypeError: data type "categorical" not understood
> {code}
> This is the workaround [suggested on stackoverflow|https://stackoverflow.com/questions/55749399/how-to-fix-the-issue-of-categoricalindex-column-in-pandas]:
> {code:python}
> df.columns = pd.Index(list(df.columns)) # suggested fix for the time being
> parquet.write_table(Table.from_pandas(df), "test.pqt")
> parquet.read_table("test.pqt").to_pandas() # no error
> {code}
> Are there any plans to support the pattern described here in the future?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)