You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Ryan Ballard (Jira)" <ji...@apache.org> on 2022/09/27 09:48:00 UTC
[jira] [Created] (ARROW-17852) `dtype` of `Categorical` category columns are not preserved
Ryan Ballard created ARROW-17852:
------------------------------------
Summary: `dtype` of `Categorical` category columns are not preserved
Key: ARROW-17852
URL: https://issues.apache.org/jira/browse/ARROW-17852
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 9.0.0
Reporter: Ryan Ballard
Hi there,
First time submitting an issue here so apologies if there's anything I've missed.
I see the below bug, where by the {{dtype}} of the categories themselves (within a {{pd.Categorical}} are not preserved on a round trip via pyarrow. Hopefully the snippet below demonstrates the issue.
Current workaround is to store as a plain {{pd.StringDtype()}} and then convert to {{pd.Categorical}} in memory with Pandas (which infers from the underlying type, but in doing so sacrifices disk saving of storing as a dictionary).
Using pyarrow 9.0.0 and pandas 1.4.4.
Thanks
{{import pandas as pd}}
{{import pyarrow as pa}}
{{{}# note, Categorical column B is constructed from `pd.{}}}{{{}StringDtype`{}}}{{{{}}{}}}
{{df = pd.DataFrame({"A": ["a", "b", "c", "a"]}, dtype=pd.StringDtype())}}
{{df["B"] = df["A"].astype("category")}}
{{print(df["B"].cat.categories)}}
{{# Index(['a', 'b', 'c'], dtype='string')}}
{{# however, this is downcast to `object` during a roundtrip}}
{{print(pa.Table.from_pandas(df).to_pandas()["B"].cat.categories)}}
{{# Index(['a', 'b', 'c'], dtype='object')}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)