You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "James Bourbeau (Jira)" <ji...@apache.org> on 2022/09/09 16:20:00 UTC

[jira] [Assigned] (ARROW-16838) Schema inference for pandas extension dtypes fails on indexes

     [ https://issues.apache.org/jira/browse/ARROW-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Bourbeau reassigned ARROW-16838:
--------------------------------------

    Assignee: James Bourbeau

> Schema inference for pandas extension dtypes fails on indexes
> -------------------------------------------------------------
>
>                 Key: ARROW-16838
>                 URL: https://issues.apache.org/jira/browse/ARROW-16838
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 8.0.0
>            Reporter: Ian Rose
>            Assignee: James Bourbeau
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Hi! pa.Schema.from_pandas called on a dataframe whose index is a pandas extension dtype (e.g., string[python]) results in an error:
> {code:python}
> import pyarrow as pa
> df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
> pa.Schema.from_pandas(df)
> {code}
> produces
> {code:python}
> AttributeError                            Traceback (most recent call last)
> /tmp/ipykernel_1827952/3691394220.py in <module>
>       1 import pyarrow as pa
>       2 df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
> ----> 3 pa.Schema.from_pandas(df)
> ~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/types.pxi in pyarrow.lib.Schema.from_pandas()
> ~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/pandas_compat.py in dataframe_to_types(df, preserve_index, columns)
>     527             type_ = pa.array(c, from_pandas=True).type
>     528         elif _pandas_api.is_extension_array_dtype(values):
> --> 529             type_ = pa.array(c.head(0), from_pandas=True).type
>     530         else:
>     531             values, type_ = get_datetimetz_type(values, c.dtype, None)
> AttributeError: 'Index' object has no attribute 'head'
> {code}
> If I remove the `head` call, or convert the index to a series manually, things work.
> Reported downstream in https://github.com/dask/dask/issues/9186
> Related issue from a couple of years ago: https://issues.apache.org/jira/browse/ARROW-8159
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)