You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Thomas Buhrmann (JIRA)" <ji...@apache.org> on 2019/07/18 17:06:00 UTC

[jira] [Commented] (ARROW-5379) [Python] support pandas' nullable Integer type in from_pandas

    [ https://issues.apache.org/jira/browse/ARROW-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888163#comment-16888163 ] 

Thomas Buhrmann commented on ARROW-5379:
----------------------------------------

For the particular case of pd.Int64Dtype, the following may be a workaround for now, in case that's useful to anybody. In short, cast pandas Int64 columns to 'object' before converting to Arrow. When converting back to pandas, import with _integer_object_nulls=True_ and cast back to Int64. Seems to work correctly for the below cases of pandas integer columns with or without NaNs, and different integer sizes:

 
{code:java}
import pandas as pd
import pyarrow as pa


def from_pandas(df):
    """Cast Int64 to object before 'serializing'"""
    for col in df:
        if isinstance(df[col].dtype, pd.Int64Dtype):
            df[col] = df[col].astype('object')
    return pa.Table.from_pandas(df)


def to_pandas(tbl):
    """After 'deserializing', recover the correct int type"""
    df = tbl.to_pandas(integer_object_nulls=True)

    for col in df:
        if (pa.types.is_integer(tbl.schema.field_by_name(col).type) and
            pd.api.types.is_object_dtype(df[col].dtype)):
                df[col] = df[col].astype('Int64')
    
    return df


df = pd.Series([0, 1, None, 2, 822215679726100500], dtype='Int64', name='x').to_frame()
# df = pd.Series([0, 1, 3, 2, 822215679726100500], dtype='Int64', name='x').to_frame()
# df = pd.Series([0, 1, 3, 2, 15], dtype='Int64', name='x').to_frame()
# df = pd.Series([0, 1, 3, 2, 15], dtype='int16', name='x').to_frame()

df2 = to_pandas(from_pandas(df))    
df2.dtypes
{code}
 

> [Python] support pandas' nullable Integer type in from_pandas
> -------------------------------------------------------------
>
>                 Key: ARROW-5379
>                 URL: https://issues.apache.org/jira/browse/ARROW-5379
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> From https://github.com/apache/arrow/issues/4168. We should add support for pandas' nullable Integer extension dtypes, as those could map nicely to arrows integer types. 
> Ideally this happens in a generic way though, and not specific for this extension type, which is discussed in ARROW-5271



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)