You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Marnix van den Broek <ma...@bundlesandbatches.io> on 2022/04/01 09:26:18 UTC

PyArrow / Arrow questions about the time and date types

hi all,

I'm working on type conversions between different systems, and the details
of both the time and date data types raised some questions about their
behaviour and a potential impact on interoperability:

*Question 1*: For my own understanding: what purpose does the millisecond
date64 type serve?

*Question 2* Relates to the definition and implementation of the date64
data type:

The definition of date64 is from schema.fbs[1] is:
*Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no
leap seconds), where the values are evenly divisible by 86400000*

However, In PyArrow I can create Date64 instances using integer input
values that are not evenly divisible by 86400000 and the original input
persists in the Arrow dataframe. That seems very counterintuitive and a
potential cause for bugs in low level transformations and when moving data
between systems with Arrow. Shouldn't (Py)Arrow either reject the input, or
convert it when explicitly asked to?

>>> pa.scalar(86499999, pa.date64())
<pyarrow.Date64Scalar: datetime.date(1970, 1, 2)>
>>> pa.scalar(86499999, pa.date64()).cast(pa.int64())
<pyarrow.Int64Scalar: 86499999>


*Question 3*: both the time32 and time64 time-of-day types, in either
precision, accept and store integer input that falls outside of the 24-hour
window. Like the issue raised about the date64 type, this seems like
unexpected behavior, possibly even impacting interoperability. I expected
the boundaries of these values to be enforced. What's the
desirable behaviour from the Arrow specification perspective? Is it the
current behaviour, or should the input either be rejected or explicitly
converted?

See:

>>> pa.scalar(-1,pa.time32('s')) # expected: exception or warning
<pyarrow.Time32Scalar: datetime.time(23, 59, 59)>
>>> pa.scalar(-1,pa.time32('s')).cast(pa.int32()) # expected: 86399
<pyarrow.Int32Scalar: -1>
>>> pa.scalar(86400,pa.time32('s')) # expected: exception or warning
<pyarrow.Time32Scalar: datetime.time(0, 0)>
>>> pa.scalar(86400,pa.time32('s')).cast(pa.int32()) # expected: 0
<pyarrow.Int32Scalar: 86400>


I'm looking for answers to understand the intended behaviour. If question 2
and 3 are actually issues with the implementations, let me know and I'll
raise them on Github (or Jira if that's where they belong).

Thanks,
Marnix van den Broek

Data Engineer at bundlesandbatches.io

[1]
https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/format/Schema.fbs#L200-L201

Re: PyArrow / Arrow questions about the time and date types

Posted by Wes McKinney <we...@gmail.com>.

On Fri, Apr 1, 2022 at 2:00 PM Weston Pace <we...@gmail.com> wrote:
>
> > *Question 1*: For my own understanding: what purpose does the
> > millisecond date64 type serve?
>
> I don't actually know the answer to this one.

The rationale IIRC was that some systems represent dates this way, and
so the purpose was to provide a serialization-free path for such data.

> > *Question 2* Relates to the definition and implementation of the
> > date64 data type:
> > ...
> > Shouldn't (Py)Arrow either reject the input, or
> > convert it when explicitly asked to?
>
> Yes.  There was a past discussion on this topic and a vote to agree
> that these are invalid.  See [1].  Feel free to file JIRAs where this
> doesn't happen.  The validation picture has improved somewhat in [2]
> which should be a part of the next release:
>
> ```
> # When given an array we sometimes will not automatically
> # validate if the validation requires inspecting the values
> # which is expensive
> >>> pa.array([86400],pa.time32('s'))
> <pyarrow.lib.Time32Array object at 0x7f7c4a8eae80>
> [
>   <value out of range: 86400>
> ]
>
> # There is a validate() method that can be called to do this
> >>> pa.array([86400],pa.time32('s')).validate(full=True)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "pyarrow/array.pxi", line 1435, in pyarrow.lib.Array.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: time32[s] 86400 is not within the acceptable
> range of [0, 86400) s
>
> # Even in the latest master it seems we do not apply this validation
> # on scalar values.  Please do file a JIRA for this
> >>> pa.scalar(86400,pa.time32('s'))
> <pyarrow.Time32Scalar: datetime.time(0, 0)>
> ```
>
> > *Question 3*: both the time32 and time64 time-of-day types, in either
> > precision, accept and store integer input that falls outside of the 24-hour
> > window.
> > ...
> > What's the
> > desirable behaviour from the Arrow specification perspective? Is it the
> > current behaviour, or should the input either be rejected or explicitly
> > converted?
>
> The invalid values should be rejected.  Ideally at the boundaries.  However,
> when data is already in the correct memory layout, we need to allow
> the possibility
> of zero-copy, and so we may not implicitly validate.
>
> For example, I would expect the following will always pass without error:
>
> ```
> pa.array([86400], pa.int32()).cast(pa.time32('s'))
> ```
>
> On the other hand this should always fail:
>
> ```
> pa.array([86400], pa.int32()).cast(pa.time32('s')).validate(full=True)
> ```
>
> Users should generally validate any data that they don't know for sure
> is correct.
>
> [1] https://lists.apache.org/thread/0yks6lkv0p7kd3b46gcbc3cbr2y4kl95
> [2] https://issues.apache.org/jira/browse/ARROW-10924
>
> On Thu, Mar 31, 2022 at 11:27 PM Marnix van den Broek
> <ma...@bundlesandbatches.io> wrote:
> >
> > hi all,
> >
> > I'm working on type conversions between different systems, and the details
> > of both the time and date data types raised some questions about their
> > behaviour and a potential impact on interoperability:
> >
> > *Question 1*: For my own understanding: what purpose does the millisecond
> > date64 type serve?
> >
> > *Question 2* Relates to the definition and implementation of the date64
> > data type:
> >
> > The definition of date64 is from schema.fbs[1] is:
> > *Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no
> > leap seconds), where the values are evenly divisible by 86400000*
> >
> > However, In PyArrow I can create Date64 instances using integer input
> > values that are not evenly divisible by 86400000 and the original input
> > persists in the Arrow dataframe. That seems very counterintuitive and a
> > potential cause for bugs in low level transformations and when moving data
> > between systems with Arrow. Shouldn't (Py)Arrow either reject the input, or
> > convert it when explicitly asked to?
> >
> > >>> pa.scalar(86499999, pa.date64())
> > <pyarrow.Date64Scalar: datetime.date(1970, 1, 2)>
> > >>> pa.scalar(86499999, pa.date64()).cast(pa.int64())
> > <pyarrow.Int64Scalar: 86499999>
> >
> >
> > *Question 3*: both the time32 and time64 time-of-day types, in either
> > precision, accept and store integer input that falls outside of the 24-hour
> > window. Like the issue raised about the date64 type, this seems like
> > unexpected behavior, possibly even impacting interoperability. I expected
> > the boundaries of these values to be enforced. What's the
> > desirable behaviour from the Arrow specification perspective? Is it the
> > current behaviour, or should the input either be rejected or explicitly
> > converted?
> >
> > See:
> >
> > >>> pa.scalar(-1,pa.time32('s')) # expected: exception or warning
> > <pyarrow.Time32Scalar: datetime.time(23, 59, 59)>
> > >>> pa.scalar(-1,pa.time32('s')).cast(pa.int32()) # expected: 86399
> > <pyarrow.Int32Scalar: -1>
> > >>> pa.scalar(86400,pa.time32('s')) # expected: exception or warning
> > <pyarrow.Time32Scalar: datetime.time(0, 0)>
> > >>> pa.scalar(86400,pa.time32('s')).cast(pa.int32()) # expected: 0
> > <pyarrow.Int32Scalar: 86400>
> >
> >
> > I'm looking for answers to understand the intended behaviour. If question 2
> > and 3 are actually issues with the implementations, let me know and I'll
> > raise them on Github (or Jira if that's where they belong).
> >
> > Thanks,
> > Marnix van den Broek
> >
> > Data Engineer at bundlesandbatches.io
> >
> > [1]
> > https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/format/Schema.fbs#L200-L201

Re: PyArrow / Arrow questions about the time and date types

Posted by Weston Pace <we...@gmail.com>.

> *Question 1*: For my own understanding: what purpose does the
> millisecond date64 type serve?

I don't actually know the answer to this one.

> *Question 2* Relates to the definition and implementation of the
> date64 data type:
> ...
> Shouldn't (Py)Arrow either reject the input, or
> convert it when explicitly asked to?

Yes.  There was a past discussion on this topic and a vote to agree
that these are invalid.  See [1].  Feel free to file JIRAs where this
doesn't happen.  The validation picture has improved somewhat in [2]
which should be a part of the next release:

```
# When given an array we sometimes will not automatically
# validate if the validation requires inspecting the values
# which is expensive
>>> pa.array([86400],pa.time32('s'))
<pyarrow.lib.Time32Array object at 0x7f7c4a8eae80>
[
  <value out of range: 86400>
]

# There is a validate() method that can be called to do this
>>> pa.array([86400],pa.time32('s')).validate(full=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 1435, in pyarrow.lib.Array.validate
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: time32[s] 86400 is not within the acceptable
range of [0, 86400) s

# Even in the latest master it seems we do not apply this validation
# on scalar values.  Please do file a JIRA for this
>>> pa.scalar(86400,pa.time32('s'))
<pyarrow.Time32Scalar: datetime.time(0, 0)>
```

> *Question 3*: both the time32 and time64 time-of-day types, in either
> precision, accept and store integer input that falls outside of the 24-hour
> window.
> ...
> What's the
> desirable behaviour from the Arrow specification perspective? Is it the
> current behaviour, or should the input either be rejected or explicitly
> converted?

The invalid values should be rejected.  Ideally at the boundaries.  However,
when data is already in the correct memory layout, we need to allow
the possibility
of zero-copy, and so we may not implicitly validate.

For example, I would expect the following will always pass without error:

```
pa.array([86400], pa.int32()).cast(pa.time32('s'))
```

On the other hand this should always fail:

```
pa.array([86400], pa.int32()).cast(pa.time32('s')).validate(full=True)
```

Users should generally validate any data that they don't know for sure
is correct.

[1] https://lists.apache.org/thread/0yks6lkv0p7kd3b46gcbc3cbr2y4kl95
[2] https://issues.apache.org/jira/browse/ARROW-10924

On Thu, Mar 31, 2022 at 11:27 PM Marnix van den Broek
<ma...@bundlesandbatches.io> wrote:
>
> hi all,
>
> I'm working on type conversions between different systems, and the details
> of both the time and date data types raised some questions about their
> behaviour and a potential impact on interoperability:
>
> *Question 1*: For my own understanding: what purpose does the millisecond
> date64 type serve?
>
> *Question 2* Relates to the definition and implementation of the date64
> data type:
>
> The definition of date64 is from schema.fbs[1] is:
> *Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no
> leap seconds), where the values are evenly divisible by 86400000*
>
> However, In PyArrow I can create Date64 instances using integer input
> values that are not evenly divisible by 86400000 and the original input
> persists in the Arrow dataframe. That seems very counterintuitive and a
> potential cause for bugs in low level transformations and when moving data
> between systems with Arrow. Shouldn't (Py)Arrow either reject the input, or
> convert it when explicitly asked to?
>
> >>> pa.scalar(86499999, pa.date64())
> <pyarrow.Date64Scalar: datetime.date(1970, 1, 2)>
> >>> pa.scalar(86499999, pa.date64()).cast(pa.int64())
> <pyarrow.Int64Scalar: 86499999>
>
>
> *Question 3*: both the time32 and time64 time-of-day types, in either
> precision, accept and store integer input that falls outside of the 24-hour
> window. Like the issue raised about the date64 type, this seems like
> unexpected behavior, possibly even impacting interoperability. I expected
> the boundaries of these values to be enforced. What's the
> desirable behaviour from the Arrow specification perspective? Is it the
> current behaviour, or should the input either be rejected or explicitly
> converted?
>
> See:
>
> >>> pa.scalar(-1,pa.time32('s')) # expected: exception or warning
> <pyarrow.Time32Scalar: datetime.time(23, 59, 59)>
> >>> pa.scalar(-1,pa.time32('s')).cast(pa.int32()) # expected: 86399
> <pyarrow.Int32Scalar: -1>
> >>> pa.scalar(86400,pa.time32('s')) # expected: exception or warning
> <pyarrow.Time32Scalar: datetime.time(0, 0)>
> >>> pa.scalar(86400,pa.time32('s')).cast(pa.int32()) # expected: 0
> <pyarrow.Int32Scalar: 86400>
>
>
> I'm looking for answers to understand the intended behaviour. If question 2
> and 3 are actually issues with the implementations, let me know and I'll
> raise them on Github (or Jira if that's where they belong).
>
> Thanks,
> Marnix van den Broek
>
> Data Engineer at bundlesandbatches.io
>
> [1]
> https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/format/Schema.fbs#L200-L201