You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Ryan Ballard (Jira)" <ji...@apache.org> on 2022/09/27 09:48:00 UTC

[jira] [Created] (ARROW-17852) `dtype` of `Categorical` category columns are not preserved

Ryan Ballard created ARROW-17852:
------------------------------------

             Summary: `dtype` of `Categorical` category columns are not preserved
                 Key: ARROW-17852
                 URL: https://issues.apache.org/jira/browse/ARROW-17852
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 9.0.0
            Reporter: Ryan Ballard


Hi there,

First time submitting an issue here so apologies if there's anything I've missed.

I see the below bug, where by the {{dtype}} of the categories themselves (within a {{pd.Categorical}} are not preserved on a round trip via pyarrow. Hopefully the snippet below demonstrates the issue.

Current workaround is to store as a plain {{pd.StringDtype()}} and then convert to {{pd.Categorical}} in memory with Pandas (which infers from the underlying type, but in doing so sacrifices disk saving of storing as a dictionary).

Using pyarrow 9.0.0 and pandas 1.4.4.

Thanks
 

{{import pandas as pd}}

{{import pyarrow as pa}}

 

{{{}# note, Categorical column B is constructed from `pd.{}}}{{{}StringDtype`{}}}{{{{}}{}}}

{{df = pd.DataFrame({"A": ["a", "b", "c", "a"]}, dtype=pd.StringDtype())}}

{{df["B"] = df["A"].astype("category")}}

{{print(df["B"].cat.categories)}}
{{# Index(['a', 'b', 'c'], dtype='string')}}

 

{{# however, this is downcast to `object` during a roundtrip}}

{{print(pa.Table.from_pandas(df).to_pandas()["B"].cat.categories)}}

{{# Index(['a', 'b', 'c'], dtype='object')}}

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)