You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/09/15 15:35:00 UTC

[jira] [Comment Edited] (ARROW-14004) to_pandas() converts to float instead of using pandas nullable types

    [ https://issues.apache.org/jira/browse/ARROW-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415597#comment-17415597 ] 

Joris Van den Bossche edited comment on ARROW-14004 at 9/15/21, 3:34 PM:
-------------------------------------------------------------------------

bq. If the column was created with pandas first it is correctly preserved (I guess it's using stored metadata for this).

That's correct.

bq. As currently there is support for nullable types in pandas, just as in Arrow, it would be great to use these types when dealing with columns with null values.

Since pandas does not yet use those nullable dtypes as the default in pandas (and there are still quite some parts of pandas that don't yet support them fully), I think pyarrow should also not yet use them by default.

bq. If you are reticent to change this behavior, a param would be nice too (e.g. `to_pandas(use_nullable_types: True)`).

There is actually already a keyword to customize the dtype used in the conversion to pandas, to support extension dtypes in general: {{types_mapper}}.

And this can be used to get the effect you want (only for int64):

{code:python}
table.to_pandas(types_mapper={pa.int64(): pd.Int64Dtype()}.get)
{code}

Now, if you want this for all integer dtypes (unsigned/signed and all bitwidths) this of course gets a bit unwieldy (but you can define this dictionary once somewhere in your code and then re-use that). 

For such a case, adding a {{use_nullable_dtypes}} keyword would be a nice short-cut (eg pandas' {{read_parquet}} already has this, and uses the {{type_mapper}} under the hood). However, I am a bit hesitant to add such a keyword as people might expect different behaviour from this (for example, pandas also has nullable float and string dtypes, and depending on your use case you might want to use nullable ints but not nullable floats (as for floats there is less benefit in using it). But a general keyword should probably enable all of them).


was (Author: jorisvandenbossche):
> If the column was created with pandas first it is correctly preserved (I guess it's using stored metadata for this).

That's correct.

> As currently there is support for nullable types in pandas, just as in Arrow, it would be great to use these types when dealing with columns with null values.

Since pandas does not yet use those nullable dtypes as the default in pandas (and there are still quite some parts of pandas that don't yet support them fully), I think pyarrow should also not yet use them by default.

> If you are reticent to change this behavior, a param would be nice too (e.g. `to_pandas(use_nullable_types: True)`).

There is actually already a keyword to customize the dtype used in the conversion to pandas, to support extension dtypes in general: {{types_mapper}}.

And this can be used to get the effect you want (only for int64):

{code:python}
table.to_pandas(types_mapper={pa.int64(): pd.Int64Dtype()}.get)
{code}

Now, if you want this for all integer dtypes (unsigned/signed and all bitwidths) this of course gets a bit unwieldy (but you can define this dictionary once somewhere in your code and then re-use that). 

For such a case, adding a {{use_nullable_dtypes}} keyword would be a nice short-cut (eg pandas' {{read_parquet}} already has this, and uses the {{type_mapper}} under the hood). However, I am a bit hesitant to add such a keyword as people might expect different behaviour from this (for example, pandas also has nullable float and string dtypes, and depending on your use case you might want to use nullable ints but not nullable floats (as for floats there is less benefit in using it). But a general keyword should probably enable all of them).

> to_pandas() converts to float instead of using pandas nullable types
> --------------------------------------------------------------------
>
>                 Key: ARROW-14004
>                 URL: https://issues.apache.org/jira/browse/ARROW-14004
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Miguel Cantón Cortés
>            Priority: Major
>         Attachments: image.png
>
>
> We've noticed that when converting an Arrow Table to pandas using `.to_pandas()` integer columns with null values get converted to float instead of using pandas nullable types.
> If the column was created with pandas first it is correctly preserved (I guess it's using stored metadata for this).
> I've attached a screenshot showing this behavior.
> As currently there is support for nullable types in pandas, just as in Arrow, it would be great to use these types when dealing with columns with null values.
> If you are reticent to change this behavior, a param would be nice too (e.g. `to_pandas(use_nullable_types: True)`).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)