You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Ji Liu <ti...@apache.org> on 2020/08/05 03:18:02 UTC

[DISCUSS] How to extended time value range for Timestamp type?

Hi all,

Now in Arrow Timestamp type, it support different TimeUnit(seconds,
milliseconds, microseconds, nanoseconds) with int64 type for storage. In
most cases this is enough, but if the timestamp value range of external
system exceeds int64_t::max, then it's impossible to directly convert to
Arrow Timestamp, consider the following user case:

A timestamp in other system with int64 + int32(stores milliseconds and
nanoseconds) can represent data from 0000-00-00 to 9999-12-31
23:59:59.999999999, if we want to convert type like this, how should we do?
One probably create an extension type with struct(int64, int32) for storage.

Besides ExtensionType, are we considering extending our Timestamp for wider
range or maybe a new type for cases above?


Thanks,
Ji Liu

Re: [DISCUSS] How to extended time value range for Timestamp type?

Posted by Jacques Nadeau <ja...@apache.org>.
+1, let's be cautious adding these kinds of things.

On Wed, Aug 5, 2020 at 5:49 AM Wes McKinney <we...@gmail.com> wrote:

> I also am not sure there is a good case for a new built-in type since it
> introduces a good deal of complexity, particularly when there is the
> extension type option. We’ve been living with 64-bit nanoseconds in pandas
> for a decade, for example (and without the option for lower resolutions!!),
> and while it does arise as a limitation from time to time, the use cases
> are so specialized that it has never made sense to do anything about it.
>
> On Tue, Aug 4, 2020 at 11:26 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > I think a stronger case needs to be made for adding a new builtin type to
> > support this.  Can you provide concrete use-cases?  Why can't dates
> outside
> > of the one representable by int64 be truncated (even for nano precision
> > 64-bits max value is is over 200 years in the future)?  It seems like in
> > most cases values at the nanosecond level that are outside the values
> > representable by 64-bits, are generally sentinel values.
> >
> > FWIW, Parquet had an int96 type that was used for timestamps but it has
> > been deprecated [1] in favor of int64 nanos.
> >
> > -Micah
> >
> > [1] https://issues.apache.org/jira/browse/PARQUET-323
> >
> > On Tue, Aug 4, 2020 at 8:52 PM Fan Liya <li...@gmail.com> wrote:
> >
> > > Hi Ji,
> > >
> > > This sounds like a universal requirement, as 64-bit is not sufficient
> to
> > > hold the precision for nano-second.
> > >
> > > For the extension type, we have two choices:
> > > 1. Extending struct(int64, int32), which represents the design of SoA
> > > (Struct of Arrays).
> > > 2. Extending fixed width binary(12), which represents the design of AoS
> > > (Array of Structs)
> > >
> > > Given the universal requirement, I'd prefer a new type.
> > >
> > > Best,
> > > Liya Fan
> > >
> > >
> > > On Wed, Aug 5, 2020 at 11:18 AM Ji Liu <ti...@apache.org> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Now in Arrow Timestamp type, it support different TimeUnit(seconds,
> > > > milliseconds, microseconds, nanoseconds) with int64 type for storage.
> > In
> > > > most cases this is enough, but if the timestamp value range of
> external
> > > > system exceeds int64_t::max, then it's impossible to directly convert
> > to
> > > > Arrow Timestamp, consider the following user case:
> > > >
> > > > A timestamp in other system with int64 + int32(stores milliseconds
> and
> > > > nanoseconds) can represent data from 0000-00-00 to 9999-12-31
> > > > 23:59:59.999999999, if we want to convert type like this, how should
> we
> > > do?
> > > > One probably create an extension type with struct(int64, int32) for
> > > > storage.
> > > >
> > > > Besides ExtensionType, are we considering extending our Timestamp for
> > > wider
> > > > range or maybe a new type for cases above?
> > > >
> > > >
> > > > Thanks,
> > > > Ji Liu
> > > >
> > >
> >
>

Re: [DISCUSS] How to extended time value range for Timestamp type?

Posted by Wes McKinney <we...@gmail.com>.
I also am not sure there is a good case for a new built-in type since it
introduces a good deal of complexity, particularly when there is the
extension type option. We’ve been living with 64-bit nanoseconds in pandas
for a decade, for example (and without the option for lower resolutions!!),
and while it does arise as a limitation from time to time, the use cases
are so specialized that it has never made sense to do anything about it.

On Tue, Aug 4, 2020 at 11:26 PM Micah Kornfield <em...@gmail.com>
wrote:

> I think a stronger case needs to be made for adding a new builtin type to
> support this.  Can you provide concrete use-cases?  Why can't dates outside
> of the one representable by int64 be truncated (even for nano precision
> 64-bits max value is is over 200 years in the future)?  It seems like in
> most cases values at the nanosecond level that are outside the values
> representable by 64-bits, are generally sentinel values.
>
> FWIW, Parquet had an int96 type that was used for timestamps but it has
> been deprecated [1] in favor of int64 nanos.
>
> -Micah
>
> [1] https://issues.apache.org/jira/browse/PARQUET-323
>
> On Tue, Aug 4, 2020 at 8:52 PM Fan Liya <li...@gmail.com> wrote:
>
> > Hi Ji,
> >
> > This sounds like a universal requirement, as 64-bit is not sufficient to
> > hold the precision for nano-second.
> >
> > For the extension type, we have two choices:
> > 1. Extending struct(int64, int32), which represents the design of SoA
> > (Struct of Arrays).
> > 2. Extending fixed width binary(12), which represents the design of AoS
> > (Array of Structs)
> >
> > Given the universal requirement, I'd prefer a new type.
> >
> > Best,
> > Liya Fan
> >
> >
> > On Wed, Aug 5, 2020 at 11:18 AM Ji Liu <ti...@apache.org> wrote:
> >
> > > Hi all,
> > >
> > > Now in Arrow Timestamp type, it support different TimeUnit(seconds,
> > > milliseconds, microseconds, nanoseconds) with int64 type for storage.
> In
> > > most cases this is enough, but if the timestamp value range of external
> > > system exceeds int64_t::max, then it's impossible to directly convert
> to
> > > Arrow Timestamp, consider the following user case:
> > >
> > > A timestamp in other system with int64 + int32(stores milliseconds and
> > > nanoseconds) can represent data from 0000-00-00 to 9999-12-31
> > > 23:59:59.999999999, if we want to convert type like this, how should we
> > do?
> > > One probably create an extension type with struct(int64, int32) for
> > > storage.
> > >
> > > Besides ExtensionType, are we considering extending our Timestamp for
> > wider
> > > range or maybe a new type for cases above?
> > >
> > >
> > > Thanks,
> > > Ji Liu
> > >
> >
>

Re: [DISCUSS] How to extended time value range for Timestamp type?

Posted by Micah Kornfield <em...@gmail.com>.
I think a stronger case needs to be made for adding a new builtin type to
support this.  Can you provide concrete use-cases?  Why can't dates outside
of the one representable by int64 be truncated (even for nano precision
64-bits max value is is over 200 years in the future)?  It seems like in
most cases values at the nanosecond level that are outside the values
representable by 64-bits, are generally sentinel values.

FWIW, Parquet had an int96 type that was used for timestamps but it has
been deprecated [1] in favor of int64 nanos.

-Micah

[1] https://issues.apache.org/jira/browse/PARQUET-323

On Tue, Aug 4, 2020 at 8:52 PM Fan Liya <li...@gmail.com> wrote:

> Hi Ji,
>
> This sounds like a universal requirement, as 64-bit is not sufficient to
> hold the precision for nano-second.
>
> For the extension type, we have two choices:
> 1. Extending struct(int64, int32), which represents the design of SoA
> (Struct of Arrays).
> 2. Extending fixed width binary(12), which represents the design of AoS
> (Array of Structs)
>
> Given the universal requirement, I'd prefer a new type.
>
> Best,
> Liya Fan
>
>
> On Wed, Aug 5, 2020 at 11:18 AM Ji Liu <ti...@apache.org> wrote:
>
> > Hi all,
> >
> > Now in Arrow Timestamp type, it support different TimeUnit(seconds,
> > milliseconds, microseconds, nanoseconds) with int64 type for storage. In
> > most cases this is enough, but if the timestamp value range of external
> > system exceeds int64_t::max, then it's impossible to directly convert to
> > Arrow Timestamp, consider the following user case:
> >
> > A timestamp in other system with int64 + int32(stores milliseconds and
> > nanoseconds) can represent data from 0000-00-00 to 9999-12-31
> > 23:59:59.999999999, if we want to convert type like this, how should we
> do?
> > One probably create an extension type with struct(int64, int32) for
> > storage.
> >
> > Besides ExtensionType, are we considering extending our Timestamp for
> wider
> > range or maybe a new type for cases above?
> >
> >
> > Thanks,
> > Ji Liu
> >
>

Re: [DISCUSS] How to extended time value range for Timestamp type?

Posted by Fan Liya <li...@gmail.com>.
Hi Ji,

This sounds like a universal requirement, as 64-bit is not sufficient to
hold the precision for nano-second.

For the extension type, we have two choices:
1. Extending struct(int64, int32), which represents the design of SoA
(Struct of Arrays).
2. Extending fixed width binary(12), which represents the design of AoS
(Array of Structs)

Given the universal requirement, I'd prefer a new type.

Best,
Liya Fan


On Wed, Aug 5, 2020 at 11:18 AM Ji Liu <ti...@apache.org> wrote:

> Hi all,
>
> Now in Arrow Timestamp type, it support different TimeUnit(seconds,
> milliseconds, microseconds, nanoseconds) with int64 type for storage. In
> most cases this is enough, but if the timestamp value range of external
> system exceeds int64_t::max, then it's impossible to directly convert to
> Arrow Timestamp, consider the following user case:
>
> A timestamp in other system with int64 + int32(stores milliseconds and
> nanoseconds) can represent data from 0000-00-00 to 9999-12-31
> 23:59:59.999999999, if we want to convert type like this, how should we do?
> One probably create an extension type with struct(int64, int32) for
> storage.
>
> Besides ExtensionType, are we considering extending our Timestamp for wider
> range or maybe a new type for cases above?
>
>
> Thanks,
> Ji Liu
>