You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Bryan Cutler (JIRA)" <ji...@apache.org> on 2017/10/17 18:23:00 UTC

[jira] [Created] (ARROW-1680) [Python] Timestamp unit change not done in from_pandas() conversion

Bryan Cutler created ARROW-1680:
-----------------------------------

             Summary: [Python] Timestamp unit change not done in from_pandas() conversion
                 Key: ARROW-1680
                 URL: https://issues.apache.org/jira/browse/ARROW-1680
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Bryan Cutler


When calling {{Array.from_pandas}} with a pandas.Series of timestamps that have 'ns' unit and specifying a type to coerce to with 'us' causes problems.  When the series has timestamps with a timezone, the unit is ignored.  When the series does not have a timezone, it is applied but causes an OverflowError when printing.

{noformat}
>>> import pandas as pd
>>> import pyarrow as pa
>>> from datetime import datetime
>>> s = pd.Series([datetime.now()])
>>> s_nyc = s.dt.tz_localize('tzlocal()').dt.tz_convert('America/New_York')
>>> arr = pa.Array.from_pandas(s_nyc, type=pa.timestamp('us', tz='America/New_York'))
>>> arr.type
TimestampType(timestamp[ns, tz=America/New_York])
>>> arr = pa.Array.from_pandas(s, type=pa.timestamp('us'))
>>> arr.type
TimestampType(timestamp[us])
>>> print(arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221)
    values = array_format(self, window=10)
  File "pyarrow/formatting.py", line 28, in array_format
    values.append(value_format(x, 0))
  File "pyarrow/formatting.py", line 49, in value_format
    return repr(x)
  File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535)
    return repr(self.as_py())
  File "pyarrow/scalar.pxi", line 240, in pyarrow.lib.TimestampValue.as_py (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:21600)
    return converter(value, tzinfo=tzinfo)
  File "pyarrow/scalar.pxi", line 204, in pyarrow.lib.lambda5 (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:7295)
    TimeUnit_MICRO: lambda x, tzinfo: pd.Timestamp(
  File "pandas/_libs/tslib.pyx", line 402, in pandas._libs.tslib.Timestamp.__new__ (pandas/_libs/tslib.c:10051)
  File "pandas/_libs/tslib.pyx", line 1467, in pandas._libs.tslib.convert_to_tsobject (pandas/_libs/tslib.c:27665)
OverflowError: Python int too large to convert to C long
{noformat}

A workaround is to manually change values with astype
{noformat}
>>> arr = pa.Array.from_pandas(s.values.astype('datetime64[us]'))
>>> arr.type
TimestampType(timestamp[us])
>>> print(arr)
<pyarrow.lib.TimestampArray object at 0x7f6a67e0a3c0>
[
  Timestamp('2017-10-17 11:04:44.308233')
]
>>> 
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)