You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Thomas Buhrmann (JIRA)" <ji...@apache.org> on 2019/07/18 17:06:00 UTC
[jira] [Commented] (ARROW-5379) [Python] support pandas' nullable
Integer type in from_pandas
[ https://issues.apache.org/jira/browse/ARROW-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888163#comment-16888163 ]
Thomas Buhrmann commented on ARROW-5379:
----------------------------------------
For the particular case of pd.Int64Dtype, the following may be a workaround for now, in case that's useful to anybody. In short, cast pandas Int64 columns to 'object' before converting to Arrow. When converting back to pandas, import with _integer_object_nulls=True_ and cast back to Int64. Seems to work correctly for the below cases of pandas integer columns with or without NaNs, and different integer sizes:
{code:java}
import pandas as pd
import pyarrow as pa
def from_pandas(df):
"""Cast Int64 to object before 'serializing'"""
for col in df:
if isinstance(df[col].dtype, pd.Int64Dtype):
df[col] = df[col].astype('object')
return pa.Table.from_pandas(df)
def to_pandas(tbl):
"""After 'deserializing', recover the correct int type"""
df = tbl.to_pandas(integer_object_nulls=True)
for col in df:
if (pa.types.is_integer(tbl.schema.field_by_name(col).type) and
pd.api.types.is_object_dtype(df[col].dtype)):
df[col] = df[col].astype('Int64')
return df
df = pd.Series([0, 1, None, 2, 822215679726100500], dtype='Int64', name='x').to_frame()
# df = pd.Series([0, 1, 3, 2, 822215679726100500], dtype='Int64', name='x').to_frame()
# df = pd.Series([0, 1, 3, 2, 15], dtype='Int64', name='x').to_frame()
# df = pd.Series([0, 1, 3, 2, 15], dtype='int16', name='x').to_frame()
df2 = to_pandas(from_pandas(df))
df2.dtypes
{code}
> [Python] support pandas' nullable Integer type in from_pandas
> -------------------------------------------------------------
>
> Key: ARROW-5379
> URL: https://issues.apache.org/jira/browse/ARROW-5379
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Joris Van den Bossche
> Priority: Major
>
> From https://github.com/apache/arrow/issues/4168. We should add support for pandas' nullable Integer extension dtypes, as those could map nicely to arrows integer types.
> Ideally this happens in a generic way though, and not specific for this extension type, which is discussed in ARROW-5271
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)