You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/01/14 11:05:00 UTC

[jira] [Created] (ARROW-7569) [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions

Joris Van den Bossche created ARROW-7569:
--------------------------------------------

             Summary: [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions
                 Key: ARROW-7569
                 URL: https://issues.apache.org/jira/browse/ARROW-7569
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Joris Van den Bossche
             Fix For: 0.16.0


ARROW-2428 was about adding such a mapping, and described three use cases (see this [comment|https://issues.apache.org/jira/browse/ARROW-2428?focusedCommentId=16914231&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16914231] for details):

* Basic roundtrip based on the pandas_metadata (in {{to_pandas}}, we check if the pandas_metadata specify pandas extension dtypes, and if so, use this as the target dtype for that column)
* Conversion for pyarrow extension types that can define their equivalent pandas extension dtype
* A way to override default conversion (eg for the built-in types, or in absence of pandas_metadata in the schema). This would require the user to be able to specify some mapping of pyarrow type or column name to the pandas extension dtype to use.

The PR that closed ARROW-2428 (https://github.com/apache/arrow/pull/5512) only covered the first two cases, and not the third case.

I think it is still interesting to also cover the third case in some way.  

An example use case are the new nullable dtypes that are introduced in pandas (eg the nullable integer dtype).  Assume I want to read a parquet file into a pandas DataFrame using this nullable integer dtype. The pyarrow Table has no pandas_metadata indicating to use this dtype (unless it was created from a pandas DataFrame that was already using this dtype, but that will often not be the case), and the pyarrow.int64() type is also not an extension type that can define its equivalent pandas extension dtype. 
Currently, the only solution is first read it into pandas DataFrame (which will use floats for the integers if there are nulls), and then afterwards to convert those floats back to a nullable integer dtype. 

A possible API for this could look like:

{code}
table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()})
{code}

to indicate that you want to convert all columns of the pyarrow table with int64 type to a pandas column using the nullable Int64 dtype.
 








--
This message was sent by Atlassian Jira
(v8.3.4#803005)