You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/06/24 22:50:00 UTC
[jira] [Commented] (ARROW-9211) [Python] ArrowInvalid error raised when deserialising pandas with pd.NaT values in object column

    [ https://issues.apache.org/jira/browse/ARROW-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144465#comment-17144465 ] 

Wes McKinney commented on ARROW-9211:
-------------------------------------

Well, raising an exception is strictly better than returning garbage to you. The problem is in the conversion of NaT to null, which is not handled correctly:

{code}
In [29]: v                                                                                                        
Out[29]: 
                   bar
0  2020-06-22 06:54:56
1                  NaT

In [30]: pa.table(v)[0].chunk(0)                                                                                  
Out[30]: 
<pyarrow.lib.TimestampArray object at 0x7f4b4070e528>
[
  2020-06-22 06:54:56.000000,
  0001-01-01 00:00:00.000000
]
{code}

pandas's NaT value is a bit weird:

{code}
In [31]: type(pd.NaT).mro()                                                                                       
Out[31]: 
[pandas._libs.tslibs.nattype.NaTType,
 pandas._libs.tslibs.nattype._NaT,
 datetime.datetime,
 datetime.date,
 object]
{code}

I'm gonna quickly see if I can get pyarrow to recognize pandas.NaT, so I'm closing this as a duplicate of ARROW-842

> [Python] ArrowInvalid error raised when deserialising pandas with pd.NaT values in object column
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9211
>                 URL: https://issues.apache.org/jira/browse/ARROW-9211
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.17.0, 0.17.1
>            Reporter: Lawrence Ling
>            Assignee: Wes McKinney
>            Priority: Major
>             Fix For: 1.0.0
>
>
> In pyarrow 0.17.x when deserialising a pandas dataframe which has pd.NaT values in an object column, an ArrowInvalid error is raised:
> {code:java}
> pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: -62135596800000000
> {code}
> Reproducible code (using pyarrow==0.17.1 and pandas==1.0.3):
> {code:java}
> import pandas as pd
> import pyarrow.ipc as ipc
> import pyarrow as pa
> v = pd.DataFrame({
>     "bar": [1592808896000000000, pd.NaT]
> })
> # works fine as datetime64[ns] but not as object type
> v = v.astype({"bar": "datetime64[ns]"}).astype({"bar": "object"})
> bs = ipc.serialize_pandas(v).to_pybytes()
> df = ipc.deserialize_pandas(bs)  # error{code}
>  In pyarrow 0.16.0 no error occurs and df is returned as:
> {code:java}
>                             bar
> 0 2020-06-22 06:54:56.000000000
> 1 1754-08-30 22:43:41.128654848
> {code}
> Was the change in 0.17.x to raise an error an intentional behaviour change? Given the previous behaviour in 0.16.0 seemed a bit like undefined behaviour already, where it converted NaT to 1754-08-30 (which seems due to the -62135596800000000 timestamp mentioned in the error above?).
> Also note that when serialized as datetime64[ns] rather than object, the code works fine in both 0.17.x and 0.16.0, returning:
> {code:java}
>                   bar
> 0 2020-06-22 06:54:56
> 1                 NaT{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)