You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Søren Fuglede Jørgensen (Jira)" <ji...@apache.org> on 2020/03/02 08:03:00 UTC

[jira] [Created] (ARROW-7980) Deserialization with pyarrow fails for certain Timestamp-based data frame

Søren Fuglede Jørgensen created ARROW-7980:
----------------------------------------------

             Summary: Deserialization with pyarrow fails for certain Timestamp-based data frame
                 Key: ARROW-7980
                 URL: https://issues.apache.org/jira/browse/ARROW-7980
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.16.0
            Reporter: Søren Fuglede Jørgensen


When following the [procedure outlined here](https://stackoverflow.com/a/57986261/5085211) to use `pyarrow` to serialize/deserialize pandas data frames, the below example fails with the given traceback:


```python

import pandas as pd
import pyarrow as pa
df = pd.DataFrame([{'Minutes5UTC': '2020-02-25T21:15:00+00:00', 'Minutes5DK': '2020-02-25T22:15:00'}])
df['Minutes5DK'] = pd.to_datetime(df.Minutes5DK)
df['Minutes5UTC'] = pd.to_datetime(df.Minutes5UTC)
context = pa.default_serialization_context()
pa.deserialize(pa.serialize(df).to_buffer().to_pybytes())

```

```
--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-6f75cc47c6d5> in <module>
----> 1 pa.deserialize(pa.serialize(df).to_buffer().to_pybytes())

~/miniconda3/envs/emission/lib/python3.8/site-packages/pyarrow/serialization.pxi in pyarrow.lib.deserialize()

~/miniconda3/envs/emission/lib/python3.8/site-packages/pyarrow/serialization.pxi in pyarrow.lib.deserialize_from()

~/miniconda3/envs/emission/lib/python3.8/site-packages/pyarrow/serialization.pxi in pyarrow.lib.SerializedPyObject.deserialize()

~/miniconda3/envs/emission/lib/python3.8/site-packages/pyarrow/serialization.pxi in pyarrow.lib.SerializationContext._deserialize_callback()

~/miniconda3/envs/emission/lib/python3.8/site-packages/pyarrow/serialization.py in _deserialize_pandas_dataframe(data)
    167 
    168     def _deserialize_pandas_dataframe(data):
--> 169         return pdcompat.serialized_dict_to_dataframe(data)
    170 
    171     def _serialize_pandas_series(obj):

~/miniconda3/envs/emission/lib/python3.8/site-packages/pyarrow/pandas_compat.py in serialized_dict_to_dataframe(data)
    661 def serialized_dict_to_dataframe(data):
    662     import pandas.core.internals as _int
--> 663     reconstructed_blocks = [_reconstruct_block(block)
    664                             for block in data['blocks']]
    665 

~/miniconda3/envs/emission/lib/python3.8/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
    661 def serialized_dict_to_dataframe(data):
    662     import pandas.core.internals as _int
--> 663     reconstructed_blocks = [_reconstruct_block(block)
    664                             for block in data['blocks']]
    665 

~/miniconda3/envs/emission/lib/python3.8/site-packages/pyarrow/pandas_compat.py in _reconstruct_block(item, columns, extension_columns)
    707                                 klass=_int.CategoricalBlock)
    708     elif 'timezone' in item:
--> 709         dtype = make_datetimetz(item['timezone'])
    710         block = _int.make_block(block_arr, placement=placement,
    711                                 klass=_int.DatetimeTZBlock,

~/miniconda3/envs/emission/lib/python3.8/site-packages/pyarrow/pandas_compat.py in make_datetimetz(tz)
    734 def make_datetimetz(tz):
    735     tz = pa.lib.string_to_tzinfo(tz)
--> 736     return _pandas_api.datetimetz_type('ns', tz=tz)
    737 
    738 

TypeError: 'NoneType' object is not callable
```


Perhaps interestingly, if I comment out the two `pd.to_datetime` lines, the thing works (perhaps unsurprisingly), but if I then include them again, the original reproducing example all of a sudden works. That is, this works:

```python
import pandas as pd                                                                      
import pyarrow as pa                                                                     
df = pd.DataFrame([{'Minutes5UTC': '2020-02-25T21:15:00+00:00', 'Minutes5DK': '2020-02-25T22:15:00'}])
context = pa.default_serialization_context()
pa.deserialize(pa.serialize(df).to_buffer().to_pybytes())

df = pd.DataFrame([{'Minutes5UTC': '2020-02-25T21:15:00+00:00', 'Minutes5DK': '2020-02-25T22:15:00'}])
df['Minutes5DK'] = pd.to_datetime(df.Minutes5DK)
df['Minutes5UTC'] = pd.to_datetime(df.Minutes5UTC)
context = pa.default_serialization_context()
pa.deserialize(pa.serialize(df).to_buffer().to_pybytes())
```

This happens with pyarrow 0.16.0, and in both pandas 0.25.3 and 1.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)