You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/05/20 14:51:00 UTC
[jira] [Commented] (ARROW-8812) [Python] Column names of type CategoricalIndex fails to convert back to pandas

    [ https://issues.apache.org/jira/browse/ARROW-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112307#comment-17112307 ] 

Joris Van den Bossche commented on ARROW-8812:
----------------------------------------------

[~jonas-nelle] thanks for the report!

So I don't think it will be possible to fully support roundtripping Categorical column names. This is because in Arrow, the column names are just strings, and not an actual type. We store the original pandas columns' type in the metadata, which we use to do an attempt to restore the original column names. This way basic types can be restored. But for a Categorical, you in principle also need to know the exact categories, which are not saved in the metadata. 
(and note, this differs for row indexes: those are stored as actual columns in the pyarrow Table, and thus have a proper type)

That said, pyarrow should certainly be able to fall back to the plain values, instead of raising an error. That's probably the better behaviour.




> [Python] Column names of type CategoricalIndex fails to convert back to pandas
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-8812
>                 URL: https://issues.apache.org/jira/browse/ARROW-8812
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: Python 3.7.7
> MacOS (Darwin-19.4.0-x86_64-i386-64bit)
> Pandas 1.0.3
> Pyarrow 0.15.1
>            Reporter: Jonas Nelle
>            Priority: Minor
>              Labels: pandas, parquet
>
> When columns are of type {{CategoricalIndex}}, saving and reading the table back causes a {{TypeError: data type "categorical" not understood}}:
> {code:python}
> import pandas as pd
> from pyarrow import parquet, Table
> base_df = pd.DataFrame([['foo', 'j', "1"],
>                         ['bar', 'j', "1"],
>                         ['foo', 'j', "1"],
>                         ['foobar', 'j', "1"]],
>                        columns=['my_cat', 'var', 'for_count'])
> base_df['my_cat'] = base_df['my_cat'].astype('category')
> df = (
>     base_df
>     .groupby(["my_cat", "var"], observed=True)
>     .agg({"for_count": "count"})
>     .rename(columns={"for_count": "my_cat_counts"})
>     .unstack(level="my_cat", fill_value=0)
> )
> print(df)
> {code}
> The resulting data frame looks something like this:
> || ||my_cat_counts|| || ||
> |my_cat|foo|bar|foobar|
> |var| | | |
> |j|2|1|1|
> Then, writing and reading causes the {{KeyError}}:
> {code:python}
> parquet.write_table(Table.from_pandas(df), "test.pqt")
> parquet.read_table("test.pqt").to_pandas()
> > TypeError: data type "categorical" not understood
> {code}
> In the example, the column is also a MultiIndex, but that isn't the problem:
> {code:python}
> df.columns = df.columns.get_level_values(1)
> parquet.write_table(Table.from_pandas(df), "test.pqt")
> parquet.read_table("test.pqt").to_pandas()
> > TypeError: data type "categorical" not understood
> {code}
> This is the workaround [suggested on stackoverflow|https://stackoverflow.com/questions/55749399/how-to-fix-the-issue-of-categoricalindex-column-in-pandas]:
> {code:python}
> df.columns = pd.Index(list(df.columns)) # suggested fix for the time being
> parquet.write_table(Table.from_pandas(df), "test.pqt")
> parquet.read_table("test.pqt").to_pandas() # no error
> {code}
> Are there any plans to support the pattern described here in the future?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)