You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "paul hess (Jira)" <ji...@apache.org> on 2020/03/12 19:55:00 UTC

[jira] [Created] (ARROW-8100) timestamp[ms] and date64 data types not working as expected on write

paul hess created ARROW-8100:
--------------------------------

             Summary: timestamp[ms] and date64 data types not working as expected on write
                 Key: ARROW-8100
                 URL: https://issues.apache.org/jira/browse/ARROW-8100
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1
            Reporter: paul hess


I expect that either timestamp[ms] or date64 will give me a millisecond presicion datetime/timestamp as written to a parquet file, instead this is the behavior I see:

 
>>> arr = pa.array([datetime(2020, 12, 20)])
>>> arr.cast(pa.timestamp('ms'), safe=False)
<pyarrow.lib.TimestampArray object at 0x117f3d4c8>
[
  2020-12-20 00:00:00.000
]>>> table = pa.Table.from_arrays([arr], names=["start_date"])>>> table
pyarrow.Table
start_date: timestamp[us]# just to make sure>>> table.column("start_date").cast(pa.timestamp('ms'), safe=False)
<pyarrow.lib.ChunkedArray object at 0x117f5e9a8>
[
  [
    2020-12-20 00:00:00.000
  ]
]# just to make extra sure>>> schema = pa.schema([pa.field("start_date", pa.timestamp("ms"))])
>>> table.cast(schema, safe=False)parquet.write_table(table, "sldkfjasldkfj.parquet", coerce_timestamps="ms", compression="SNAPPY", allow_truncated_timestamps=True)
Result for the written file:

Schema:
{quote}{
 "type" : "record",
 "name" : "schema",
 "fields" : [ {
 "name" : "start_date",
 "type" : [ "null", {
 "type" : "long",
 "logicalType" : "timestamp-millis"
 } ],
 "default" : null
 } ]
}
{quote}
Data:
||start_date|| ||
|1608422400000| |

 

that is a microsecond [us] value, despite casting to [ms] and setting the appropriate config on the write_table method. If it was a millisecond timestamp it would be accurate to translate back to a datetime with fromtimestamp, but:
>>> from datetime import datetime
>>>
>>>
>>>
>>>
>>> datetime.fromtimestamp(1608422400000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: year 52938 is out of range
>>> datetime.fromtimestamp(1608422400000 /1000)
datetime.datetime(2020, 12, 19, 16, 0)
 

 

Ok, so then we should use date64() type, after all the docs say *_Create instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)_* 

 
>>> arr = pa.array([datetime(2020, 12, 20, 0, 0, 0, 123)], type=pa.date64())
>>> arr
<pyarrow.lib.Date64Array object at 0x11da877c8>
[
  2020-12-20
]>>> table = pa.Table.from_arrays([arr], names=["start_date"])
>>> table
pyarrow.Table
start_date: date64[ms]parquet.write_table(table, "/Users/hessp/ddt/rest-ingress/bebedabeep.parquet", coerce_timestamps="ms", compression="SNAPPY", allow_truncated_timestamps=True)
 

Result for the written file:

Schema:
{quote}{
 "type" : "record",
 "name" : "schema",
 "fields" : [ {
 "name" : "start_date",
 "type" : [ "null", {
 "type" : "int",
 "logicalType" : "date"
 } ],
 "default" : null
 } ]
}
{quote}
Data:

 
||start_date|| ||
|18616| |

 
That is "days since UNIX epoch 1970-01-01" just like date32() type, the time info is stripped off, we can confirm this:
>>> arr.to_pylist()
[datetime.date(2020, 12, 20)]
 

How do I write a millisecond precision timestamp with pyarrow.parquet?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

RE: [jira] [Created] (ARROW-8100) timestamp[ms] and date64 data types not working as expected on write

Posted by "Lee, David" <Da...@blackrock.com>.
I've never used cast().. I've converted python datetimes to pa.timestamp(s) using:

pyarrow.array(obj, type=None, mask=None, size=None, from_pandas=None, bool safe=True, MemoryPool memory_pool=None)

where type is pa.timestamp("ms")

-----Original Message-----
From: paul hess (Jira) <ji...@apache.org> 
Sent: Thursday, March 12, 2020 12:55 PM
To: dev@arrow.apache.org
Subject: [jira] [Created] (ARROW-8100) timestamp[ms] and date64 data types not working as expected on write

External Email: Use caution with links and attachments


paul hess created ARROW-8100:
--------------------------------

             Summary: timestamp[ms] and date64 data types not working as expected on write
                 Key: ARROW-8100
                 URL: https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_ARROW-2D8100&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=4tk3nY-tC06h8Xo6_Bai25z7_zNCNOzc_gO7Qc2pYIg&s=0E7ejjxbBHhhmvG0HjoWh2plGQFWryyo3CJXT8jZbiA&e=
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1
            Reporter: paul hess


I expect that either timestamp[ms] or date64 will give me a millisecond presicion datetime/timestamp as written to a parquet file, instead this is the behavior I see:


>>> arr = pa.array([datetime(2020, 12, 20)]) 
>>> arr.cast(pa.timestamp('ms'), safe=False)
<pyarrow.lib.TimestampArray object at 0x117f3d4c8> [
  2020-12-20 00:00:00.000
]>>> table = pa.Table.from_arrays([arr], names=["start_date"])>>> table pyarrow.Table
start_date: timestamp[us]# just to make sure>>> table.column("start_date").cast(pa.timestamp('ms'), safe=False) <pyarrow.lib.ChunkedArray object at 0x117f5e9a8> [
  [
    2020-12-20 00:00:00.000
  ]
]# just to make extra sure>>> schema = pa.schema([pa.field("start_date", pa.timestamp("ms"))])
>>> table.cast(schema, safe=False)parquet.write_table(table, 
>>> "sldkfjasldkfj.parquet", coerce_timestamps="ms", 
>>> compression="SNAPPY", allow_truncated_timestamps=True)
Result for the written file:

Schema:
{quote}{
 "type" : "record",
 "name" : "schema",
 "fields" : [ {
 "name" : "start_date",
 "type" : [ "null", {
 "type" : "long",
 "logicalType" : "timestamp-millis"
 } ],
 "default" : null
 } ]
}
{quote}
Data:
||start_date|| ||
|1608422400000| |



that is a microsecond [us] value, despite casting to [ms] and setting the appropriate config on the write_table method. If it was a millisecond timestamp it would be accurate to translate back to a datetime with fromtimestamp, but:
>>> from datetime import datetime
>>>
>>>
>>>
>>>
>>> datetime.fromtimestamp(1608422400000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: year 52938 is out of range
>>> datetime.fromtimestamp(1608422400000 /1000)
datetime.datetime(2020, 12, 19, 16, 0)




Ok, so then we should use date64() type, after all the docs say *_Create instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)_*


>>> arr = pa.array([datetime(2020, 12, 20, 0, 0, 0, 123)], 
>>> type=pa.date64()) arr
<pyarrow.lib.Date64Array object at 0x11da877c8> [
  2020-12-20
]>>> table = pa.Table.from_arrays([arr], names=["start_date"])
>>> table
pyarrow.Table
start_date: date64[ms]parquet.write_table(table, "/Users/hessp/ddt/rest-ingress/bebedabeep.parquet", coerce_timestamps="ms", compression="SNAPPY", allow_truncated_timestamps=True)


Result for the written file:

Schema:
{quote}{
 "type" : "record",
 "name" : "schema",
 "fields" : [ {
 "name" : "start_date",
 "type" : [ "null", {
 "type" : "int",
 "logicalType" : "date"
 } ],
 "default" : null
 } ]
}
{quote}
Data:


||start_date|| ||
|18616| |


That is "days since UNIX epoch 1970-01-01" just like date32() type, the time info is stripped off, we can confirm this:
>>> arr.to_pylist()
[datetime.date(2020, 12, 20)]


How do I write a millisecond precision timestamp with pyarrow.parquet?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2020 BlackRock, Inc. All rights reserved.