You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/02/04 11:07:00 UTC

[jira] [Commented] (ARROW-7747) [Python] coerce_timestamps + allow_truncated_timestamps does not work as expected with nanoseconds

    [ https://issues.apache.org/jira/browse/ARROW-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029759#comment-17029759 ] 

Joris Van den Bossche commented on ARROW-7747:
----------------------------------------------

[~theophile] The error is thrown by the {{pa.Table.from_pandas(df, schema=pyarrow_schema)}} call (so not the parquet writing where you specify {{coerce_timestamps="ms", allow_truncated_timestamps=True}}).

For that {{from_pandas}} call, the error is expected I think, although we currently don't have an option to allow this (to indicate you are OK with losing data).

If you leave out the expected schema (and thus convert to an Arrow table with nanosecond precision timestamps), the next step of writing ms resolution to parquet works as expected:

{code}
>>> table = pa.Table.from_pandas(df) 
>>> table                                                                                                                                                                                                     
pyarrow.Table
datetime_ms: timestamp[ns]
metadata
--------
{b'pandas': ...

>>> pq.write_table( 
...     table, 
...     "test.parquet", 
...     coerce_timestamps="ms", 
...     allow_truncated_timestamps=True, 
... )  
>>> pq.read_table("test.parquet").to_pandas() 
              datetime_ms
0 2019-06-21 22:13:02.901

{code}

> [Python] coerce_timestamps + allow_truncated_timestamps does not work as expected with nanoseconds 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7747
>                 URL: https://issues.apache.org/jira/browse/ARROW-7747
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>            Reporter: Théophile Chevalier
>            Priority: Major
>
> Hi,
> I've encountered what seems to me a bug using
> {noformat}
> pyarrow==0.15.1
> pandas==0.25.3
> numpy==1.18.1{noformat}
>  
> I'm trying to write a table containing nanosecond timestamps to a millisecond schema. Here is a minimal example:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import numpy as np
> pyarrow_schema = pa.schema([pa.field("datetime_ms", pa.timestamp("ms"))])
> timestamp = np.datetime64("2019-06-21T22:13:02.901123")
> d = {"datetime_ms": timestamp}
> df = pd.DataFrame(d, index=range(1))
> table = pa.Table.from_pandas(df, schema=pyarrow_schema)
> pq.write_table(
>     table,
>     "test.parquet",
>     coerce_timestamps="ms",
>     allow_truncated_timestamps=True,
> )
> {code}
> {noformat}
> pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would lose data: 1561155182901123000', 'Conversion failed for column datetime_ms with type datetime64[ns]'){noformat}
> From my understanding, the expected behaviour shoud be arrow allowing the conversion anyway, even if loosing some data.
> Related discussions:
> - https://github.com/apache/arrow/issues/1920
> - https://issues.apache.org/jira/browse/ARROW-2555
> This test https://github.com/apache/arrow/blob/f70dbd1dbdb51a47e6a8a8aac8efd40ccf4d44f2/python/pyarrow/tests/test_parquet.py#L846 does not explicitely check for nanosecond timestamps.
> To be honest I've not checked at the code yet, so let me know whether I missed something. I'd be happy to fix it if it's really a bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)