You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Daniel Figus (Jira)" <ji...@apache.org> on 2020/09/10 10:05:00 UTC
[jira] [Commented] (ARROW-8944) [Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp

    [ https://issues.apache.org/jira/browse/ARROW-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193511#comment-17193511 ] 

Daniel Figus commented on ARROW-8944:
-------------------------------------

[~jorisvandenbossche] I think this can be closed as it was resolved with ARROW-842. Just double checked it and my example from above works.

> [Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-8944
>                 URL: https://issues.apache.org/jira/browse/ARROW-8944
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.17.0, 0.17.1
>         Environment: pandas==1.0.3
> pyarrow==0.17.1
> Python==3,7.6 @ Windows 10 64Bit
>            Reporter: Daniel Figus
>            Priority: Major
>
> The following pandas -> parquet -> pandas roudtrip raises an out of bounds timestamp error with pyarrow 0.17.0 and 0.17.1:
> {code:python}
> import pandas
> target = 'ts_roundtrip.parquet'
> dataframe = pandas.DataFrame({'id':[1,2,3],'timestamp':['', '', '']})
> dataframe['timestamp'] = pandas.to_datetime(dataframe['timestamp'],errors='raise')
> dataframe2 = pandas.DataFrame({'id':[4,5,6,7],'timestamp':['', '2020-03-02T03:03:17.791062Z','','']})
> dataframe2['timestamp'] = pandas.to_datetime(dataframe2['timestamp'],errors='raise')
> dataframe = dataframe.append(dataframe2)
> print(dataframe.head(10))
> dataframe.to_parquet(target, coerce_timestamps=None, index=False, version='2.0')
> dataframe_new = pandas.read_parquet(target)
> print(dataframe_new.head())
> {code}
> Output:
> {noformat}
>    id                         timestamp
> 0   1                               NaT
> 1   2                               NaT
> 2   3                               NaT
> 0   4                               NaT
> 1   5  2020-03-02 03:03:17.791062+00:00
> 2   6                               NaT
> 3   7                               NaT
> Traceback (most recent call last):
>   File "c:\some\path\pyarrow_ts_test.py", line 16, in <module>
>     dataframe_new = pandas.read_parquet(target)
>   File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 310, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 125, in read
>     path, columns=columns, **kwargs
>   File "pyarrow\array.pxi", line 587, in pyarrow.lib._PandasConvertible.to_pandas
>   File "pyarrow\table.pxi", line 1640, in pyarrow.lib.Table._to_pandas
>   File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line 766, in table_to_blockmanager
>     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
>   File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line 1102, in _table_to_blocks
>     list(extension_columns.keys()))
>   File "pyarrow\table.pxi", line 1107, in pyarrow.lib.table_to_blocks
>   File "pyarrow\error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: -62135596800000000
> {noformat}
> Background: 
>  We have a dataset with a timestamp column that is sparsely populated and originates from many json files. So it is very likely that in some of those json files there is no timestamp (as string in ISO format) and instead just an empty string. Each JSON file was read into a pandas dataframe, the timestamp column casted to datetime and all dataframes appended. That was done with pyarrow<0.17.0 and those parquet files cannot be read any longer and result in the above mentioned error message as well.
> A closer look at our old parquets show that the NaTs are converted to "1754-08-30 22:43:41.128654848" when reading back to a pandas dataframe :(. You get the same result when you run the above code and pyarrow==0.16.0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)