You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ryan Ballard (Jira)" <ji...@apache.org> on 2022/09/27 09:49:00 UTC
[jira] [Updated] (ARROW-17852) `dtype` of `Categorical` category columns are not preserved
[ https://issues.apache.org/jira/browse/ARROW-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan Ballard updated ARROW-17852:
---------------------------------
Description:
Hi there,
First time submitting an issue here so apologies if there's anything I've missed.
I see the below bug, where by the {{dtype}} of the categories themselves (within a {{pd.Categorical}} are not preserved on a round trip via pyarrow. Hopefully the snippet below demonstrates the issue.
Current workaround is to store as a plain {{pd.StringDtype()}} and then convert to {{pd.Categorical}} in memory with Pandas (which infers from the underlying type, but in doing so sacrifices disk saving of storing as a dictionary).
Using pyarrow 9.0.0 and pandas 1.4.4.
Thanks
{{import pandas as pd}}
{{import pyarrow as pa}}
{{{}# note, Categorical column B is constructed from `pd.{}}}{{{}StringDtype`{}}}
df = pd.DataFrame(\{"A": ["a", "b", "c", "a"]}, dtype=pd.StringDtype())
{{df["B"] = df["A"].astype("category")}}
{{print(df["B"].cat.categories)}}
{{# Index(['a', 'b', 'c'], dtype='string')}}
{{# however, this is downcast to `object` during a roundtrip}}
{{print(pa.Table.from_pandas(df).to_pandas()["B"].cat.categories)}}
{{# Index(['a', 'b', 'c'], dtype='object')}}
was:
Hi there,
First time submitting an issue here so apologies if there's anything I've missed.
I see the below bug, where by the {{dtype}} of the categories themselves (within a {{pd.Categorical}} are not preserved on a round trip via pyarrow. Hopefully the snippet below demonstrates the issue.
Current workaround is to store as a plain {{pd.StringDtype()}} and then convert to {{pd.Categorical}} in memory with Pandas (which infers from the underlying type, but in doing so sacrifices disk saving of storing as a dictionary).
Using pyarrow 9.0.0 and pandas 1.4.4.
Thanks
{{import pandas as pd}}
{{import pyarrow as pa}}
{{{}# note, Categorical column B is constructed from `pd.{}}}{{{}StringDtype`{}}}{{{{}}{}}}
{{df = pd.DataFrame({"A": ["a", "b", "c", "a"]}, dtype=pd.StringDtype())}}
{{df["B"] = df["A"].astype("category")}}
{{print(df["B"].cat.categories)}}
{{# Index(['a', 'b', 'c'], dtype='string')}}
{{# however, this is downcast to `object` during a roundtrip}}
{{print(pa.Table.from_pandas(df).to_pandas()["B"].cat.categories)}}
{{# Index(['a', 'b', 'c'], dtype='object')}}
> `dtype` of `Categorical` category columns are not preserved
> -----------------------------------------------------------
>
> Key: ARROW-17852
> URL: https://issues.apache.org/jira/browse/ARROW-17852
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 9.0.0
> Reporter: Ryan Ballard
> Priority: Major
> Labels: categorical, pandas, pyarrow
>
> Hi there,
> First time submitting an issue here so apologies if there's anything I've missed.
> I see the below bug, where by the {{dtype}} of the categories themselves (within a {{pd.Categorical}} are not preserved on a round trip via pyarrow. Hopefully the snippet below demonstrates the issue.
> Current workaround is to store as a plain {{pd.StringDtype()}} and then convert to {{pd.Categorical}} in memory with Pandas (which infers from the underlying type, but in doing so sacrifices disk saving of storing as a dictionary).
> Using pyarrow 9.0.0 and pandas 1.4.4.
> Thanks
>
> {{import pandas as pd}}
> {{import pyarrow as pa}}
>
> {{{}# note, Categorical column B is constructed from `pd.{}}}{{{}StringDtype`{}}}
> df = pd.DataFrame(\{"A": ["a", "b", "c", "a"]}, dtype=pd.StringDtype())
> {{df["B"] = df["A"].astype("category")}}
> {{print(df["B"].cat.categories)}}
> {{# Index(['a', 'b', 'c'], dtype='string')}}
>
> {{# however, this is downcast to `object` during a roundtrip}}
> {{print(pa.Table.from_pandas(df).to_pandas()["B"].cat.categories)}}
> {{# Index(['a', 'b', 'c'], dtype='object')}}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)