You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/09/27 15:45:00 UTC

[jira] [Updated] (ARROW-17852) [python] `dtype` of `Categorical` category columns are not preserved

     [ https://issues.apache.org/jira/browse/ARROW-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Weston Pace updated ARROW-17852:
--------------------------------
    Summary: [python] `dtype` of `Categorical` category columns are not preserved  (was: `dtype` of `Categorical` category columns are not preserved)

> [python] `dtype` of `Categorical` category columns are not preserved
> --------------------------------------------------------------------
>
>                 Key: ARROW-17852
>                 URL: https://issues.apache.org/jira/browse/ARROW-17852
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: Ryan Ballard
>            Priority: Major
>              Labels: categorical, pandas, pyarrow
>
> Hi there,
> First time submitting an issue here so apologies if there's anything I've missed.
> I see the below bug, where by the {{dtype}} of the categories themselves (within a {{pd.Categorical}} are not preserved on a round trip via pyarrow. Hopefully the snippet below demonstrates the issue.
> The reason this causes an issue, is because the dtypes need to be the same in order for the categories to be considered the same (so they can then be concatenated, for example).
> Current workaround is to store as a plain {{pd.StringDtype()}} and then convert to {{pd.Categorical}} in memory with Pandas (which infers from the underlying type, but in doing so sacrifices disk saving of storing as a dictionary).
> Using pyarrow 9.0.0 and pandas 1.4.4.
> Thanks
>  
> {{import pandas as pd}}
> {{import pyarrow as pa}}
>  
> {{{}# note, Categorical column B is constructed from `pd.{}}}{{{}StringDtype`{}}}
> {{df = pd.DataFrame(\{"A": ["a", "b", "c", "a"]\}, dtype=pd.StringDtype())}}
> {{df["B"] = df["A"].astype("category")}}
> {{print(df["B"].cat.categories)}}
> {{# Index(['a', 'b', 'c'], dtype='string')}}
>  
> {{# however, this is downcast to `object` during a roundtrip}}
> {{print(pa.Table.from_pandas(df).to_pandas()["B"].cat.categories)}}
> {{# Index(['a', 'b', 'c'], dtype='object')}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)