You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/06/07 11:52:00 UTC
[jira] [Commented] (ARROW-2298) [Python] Add option to not consider NaN to be null when converting to an integer Arrow type

    [ https://issues.apache.org/jira/browse/ARROW-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858547#comment-16858547 ] 

Joris Van den Bossche commented on ARROW-2298:
----------------------------------------------

[~farnoy] For me, the example you show above works:
{code}
In [33]: schema = pa.schema([)a.field(name='a', type=pa.int64(), nullable=True)])                                                                                                                    

In [34]: pa.Table.from_pandas(df, schema=schema, preserve_index=False)                                                                                                                                         
Out[34]: 
pyarrow.Table
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
            b' "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "'
            b'float64", "metadata": null}], "creator": {"library": "pyarrow", '
            b'"version": "0.13.1.dev313+g997226a9"}, "pandas_version": "0.24.2'
            b'"}'}

In [35]: table = _                                                                                                                                                                                                  

In [36]: table.column('a')                                                                                                                                                                                          
Out[36]: 
<Column name='a' type=DataType(int64)>
[
  [
    null,
    1,
    2,
    3,
    null
  ]
]
{code}

this is because in {{Table.from_pandas}} we assume data are coming from pandas and allow the above. 

Using just the array API, you can see that with (converting float numpy array to integer arrow array):

{code:python}
In [41]: pa.array(np.array([1, 2, np.nan], dtype=float), type=pa.int64())                                                                                                                                           
...
ArrowInvalid: Floating point value truncated

In [42]: pa.array(np.array([1, 2, np.nan], dtype=float), type=pa.int64(), from_pandas=True)                                                                                                                         
Out[42]: 
<pyarrow.lib.Int64Array object at 0x7feaeea36548>
[
  1,
  2,
  null
]
{code}

Does that satisfy your use case? 

It might not help with for very big integers that cannot be represented properly as floats (that will still raise an error about values being truncated), but I think if you are coming from pandas, that use case will not be very frequent, exactly because pandas cannot properly represent that itself.

> [Python] Add option to not consider NaN to be null when converting to an integer Arrow type
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-2298
>                 URL: https://issues.apache.org/jira/browse/ARROW-2298
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.14.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Follow-on work to ARROW-2135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)