You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Julien Le Dem <ju...@dremio.com> on 2016/10/03 22:16:38 UTC

Re: Timestamps with different precision / Timedeltas

consistency with Parquet a +
Parquet supports timestamp millis and micros (no nanos)
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#datetime-types

currently Arrow timestamps have a timezone field.
https://github.com/apache/arrow/blob/master/format/Message.fbs#L67
Wes: regarding your suggestion do we want to change timestamp as follows?
- remove "timestamp" field and say it's UTC
- add unit field (MICROS | MILLIS)



On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss <do...@gmail.com> wrote:

> +1 for nano or milli, or something else?
>
> TL;DR;
>
> epochMilli++
>
> —
>
> Wes, the hierarchy is eminently reasonable, so +1 from me for that.
> Regarding your aside, I am also a fan of the http://speleotrove.com/
> decimal/decarith.html <http://speleotrove.com/decimal/decarith.html>
> specification, though I must admit I am biased simply because it addresses
> the Rexx Lost Digits condition.
>
> The most commonly used timestamps I see are stored as epoch milliseconds,
> or epochMillis.  It may not be canonical, however there are many billions
> of devices and software applications utilizing it.
>
> To support extremely fine grained DateTime representations, particularly
> in common scientific applications, I’m for _epochNano_, with logical
> casting to work with existing datasets that are in epochMilli instead.  We
> can deal with the rollover in 300k years.
>
> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z, I
> doubt it will ever happen. No, I’m not a millennial.
>
> My only concern is for use of 64-bit logical DateTime at the small Physics
> level.  For that use case, UT2 is more appropriate; measurements are
> frequently in fractions of nanoseconds.  Perhaps there could be a way to
> logically cast a signed int96, which is supported by Parquet.
>
> Timestamp [logical type]
> extends FixedDecimal [logical type] (int64)
> extends FixedWidth [physical type] byteArray[8]
>
> Timestamp96 [logical type]
> extends FixedDecimal [logical type] (int96)
> extends FixedWidth [physical type] byteArray[12]
>
> —
>
> Although inappurtenant to this specific discussion, I would like to see a
> standardized DateTime specification that uses a signed int64 as the decimal
> epochSecond and an unsigned int96 as the fractional representation of a
> second.
>
> TimestampHiggs [logical type]
> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of 2
> columns, the fixed decimal epochSecond and the fractional second as
> (n/2^96).
> extends FixedWidth [physical type] byteArray[8], byteArray[12]
>
> —Donald
>
> > On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <ja...@apache.org> wrote:
> >
> > +1
> >
> > On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >
> >> hello,
> >>
> >> For the current iteration of Arrow, can we agree to support int64 UNIX
> >> timestamps with a particular resolution (second through nanosecond),
> >> as these are reasonably common representations? We can look to expand
> >> later if it is needed.
> >>
> >> Thanks
> >> Wes
> >>
> >> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <we...@gmail.com>
> wrote:
> >>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
> >>> purposes of moving data between systems, at minimum) we should propose
> >>> timestamp metadata and physical memory representation that maximizes
> >>> interoperability with other systems. It seems like a fixed decimal
> >>> would meet this requirement as UNIX-like timestamps at some resolution
> >>> could pass unmodified with appropriate metadata.
> >>>
> >>> We will also need decimal types in Arrow (at least to accommodate
> >>> common database representations and file formats like Parquet), so
> >>> this seems like a reasonable potential hierarchy of types:
> >>>
> >>> Timestamp [logical type]
> >>> extends FixedDecimal [logical type]
> >>> extends FixedWidth [physical type]
> >>>
> >>> I did a bit of internet searching but did not find a canonical
> >>> reference or implementation of fixed decimals; that would be helpful.
> >>>
> >>> As an aside: for floating decimal numbers for numerical data we could
> >>> utilize an implementation like http://www.bytereef.org/mpdecimal/
> >>> which implements the spec described at
> >>> http://speleotrove.com/decimal/decarith.html
> >>>
> >>> Thanks
> >>> Wes
> >>>
> >>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <al...@alexsamuel.net>
> >> wrote:
> >>>> Hi all,
> >>>>
> >>>> May I suggest that instead of fixed-point decimals, you consider a
> more
> >>>> general fixed-denominator rational representation, for times and other
> >>>> purposes? Powers of ten are convenient for humans, but powers of two
> >> more
> >>>> efficient. For some applications, the efficiency of bit operations
> over
> >>>> divmod is more useful than an exact representation of integral
> >> nanoseconds.
> >>>>
> >>>> std::chrono takes this approach. I'll also humbly point you at my own
> >>>> date/time library, https://github.com/alexhsamuel/cron (incomplete
> but
> >>>> basically working), which may provide ideas or useful code. It was
> >> intended
> >>>> for precisely this sort of application.
> >>>>
> >>>> Regards,
> >>>> Alex
> >>>>
> >>>>
> >>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
> >>>>
> >>>>> I agree with that having a Decimal type for timestamps is a nice
> >>>>> definition. Haying your time encoded as seconds or nanoseconds should
> >> be
> >>>>> the same as having a scale of the respective amount. But I would
> rather
> >>>>> avoid having a separate decimal physical type. Therefore I'd prefer
> the
> >>>>> parquet approach where decimal is only a logical type and backed by
> >>>>> either a bytearray, int32 or int64.
> >>>>>
> >>>>> Thus a more general timestamp could look like:
> >>>>>
> >>>>> * Decimals are logical types, physical types are the same as defined
> in
> >>>>> Parquet [1]
> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
> >>>>> nanoseconds by using a different scale. .(Note that seconds and so on
> >>>>> are all powers of ten, thus matching the specification of decimal
> scale
> >>>>> really good).
> >>>>> * Timestamp is just another logical type that is referring to Decimal
> >>>>> (and optionally may have a timezone) and signalling that we have a
> Time
> >>>>> and not just a "simple" decimal.
> >>>>> * For a first iteration, I would assume no timezone or UTC but not
> >>>>> include a metadata field. Once we're sure the implementation works,
> we
> >>>>> can add metadata about it.
> >>>>>
> >>>>> Timedeltas could be addressed in a similar way, just without the need
> >>>>> for a timezone.
> >>>>>
> >>>>> For my usages, I don't have the use-case for a larger than int64
> >>>>> timestamp and would like to have it exactly as such in my
> computation,
> >>>>> thus my preference for the Parquet way.
> >>>>>
> >>>>> Uwe
> >>>>>
> >>>>> [1]
> >>>>>
> >>>>> https://github.com/apache/parquet-format/blob/master/
> >> LogicalTypes.md#decimal
> >>>>>
> >>>>> On 13.07.16 03:06, Julian Hyde wrote:
> >>>>>> I'm talking about a fixed decimal type, not floating decimal.
> (Oracle
> >>>>>> numbers are floating decimal. They have a few nice properties, but
> >>>>>> they are variable width and can get quite large. I've seen one or
> two
> >>>>>> systems that started with binary flo
> >>>>
> >>>>
> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
> >>>>
> >>>> nanoseconds by using a different scale. .(Note that seconds and so on
> >>>>
> >>>> are all powers of ten, thus matching the specification of decimal
> scale
> >>>>
> >>>> really good).
> >>>>
> >>>> * Timestamp is just another logical type that is referring to Decimal
> >>>>
> >>>> (and optionally may have a timezone) and signalling that we have a Tim
> >>>>
> >>>> ating point numbers, which are
> >>>>>> much worse for business computing, and then change to Java
> >> BigDecimal,
> >>>>>> which gives the right answer but are horribly inefficient.)
> >>>>>>
> >>>>>> A fixed decimal type has virtually zero computational overhead. It
> >>>>>> just has a piece of metadata saying something like "every value in
> >>>>>> this field is multiplied by 1 million" and leaves it to the client
> >>>>>> program to do that multiplying.
> >>>>>>
> >>>>>> My advice is to create a good fixed decimal type and lean on it
> >> heavily.
> >>>>>>
> >>>>>> Julian
> >>>>>>
> >>>>>
> >>>>>
> >>
>
>


-- 
Julien

Re: Timestamps with different precision / Timedeltas

Posted by Julian Hyde <jh...@apache.org>.
In SQL, date-time values have no timezone, and they are not implicitly UTC. It is up to the user to supply a timezone. Sounds like what you are proposing is a moment in time (similar to Unix time, and what Joda calls an “instant”). That’s fine, but be aware that you are diverging from SQL.

> On Oct 3, 2016, at 4:32 PM, Julien Le Dem <ju...@dremio.com> wrote:
> 
> Here is a PR for the change in timestamp:
> https://github.com/apache/arrow/pull/156
> 
> We should also clarify Date:
> https://issues.apache.org/jira/browse/ARROW-316
> 
> On Mon, Oct 3, 2016 at 3:23 PM, Julien Le Dem <ju...@dremio.com> wrote:
> 
>> I created a JIRA for the Timestamp type if you want to comment in it:
>> https://issues.apache.org/jira/browse/ARROW-315
>> 
>> On Mon, Oct 3, 2016 at 3:16 PM, Julien Le Dem <ju...@dremio.com> wrote:
>> 
>>> consistency with Parquet a +
>>> Parquet supports timestamp millis and micros (no nanos)
>>> https://github.com/apache/parquet-format/blob/master/Logical
>>> Types.md#datetime-types
>>> 
>>> currently Arrow timestamps have a timezone field.
>>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L67
>>> Wes: regarding your suggestion do we want to change timestamp as follows?
>>> - remove "timestamp" field and say it's UTC
>>> - add unit field (MICROS | MILLIS)
>>> 
>>> 
>>> 
>>> On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss <do...@gmail.com>
>>> wrote:
>>> 
>>>> +1 for nano or milli, or something else?
>>>> 
>>>> TL;DR;
>>>> 
>>>> epochMilli++
>>>> 
>>>> —
>>>> 
>>>> Wes, the hierarchy is eminently reasonable, so +1 from me for that.
>>>> Regarding your aside, I am also a fan of the
>>>> http://speleotrove.com/decimal/decarith.html <
>>>> http://speleotrove.com/decimal/decarith.html> specification, though I
>>>> must admit I am biased simply because it addresses the Rexx Lost Digits
>>>> condition.
>>>> 
>>>> The most commonly used timestamps I see are stored as epoch
>>>> milliseconds, or epochMillis.  It may not be canonical, however there are
>>>> many billions of devices and software applications utilizing it.
>>>> 
>>>> To support extremely fine grained DateTime representations, particularly
>>>> in common scientific applications, I’m for _epochNano_, with logical
>>>> casting to work with existing datasets that are in epochMilli instead.  We
>>>> can deal with the rollover in 300k years.
>>>> 
>>>> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z,
>>>> I doubt it will ever happen. No, I’m not a millennial.
>>>> 
>>>> My only concern is for use of 64-bit logical DateTime at the small
>>>> Physics level.  For that use case, UT2 is more appropriate; measurements
>>>> are frequently in fractions of nanoseconds.  Perhaps there could be a way
>>>> to logically cast a signed int96, which is supported by Parquet.
>>>> 
>>>> Timestamp [logical type]
>>>> extends FixedDecimal [logical type] (int64)
>>>> extends FixedWidth [physical type] byteArray[8]
>>>> 
>>>> Timestamp96 [logical type]
>>>> extends FixedDecimal [logical type] (int96)
>>>> extends FixedWidth [physical type] byteArray[12]
>>>> 
>>>> —
>>>> 
>>>> Although inappurtenant to this specific discussion, I would like to see
>>>> a standardized DateTime specification that uses a signed int64 as the
>>>> decimal epochSecond and an unsigned int96 as the fractional representation
>>>> of a second.
>>>> 
>>>> TimestampHiggs [logical type]
>>>> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of
>>>> 2 columns, the fixed decimal epochSecond and the fractional second as
>>>> (n/2^96).
>>>> extends FixedWidth [physical type] byteArray[8], byteArray[12]
>>>> 
>>>> —Donald
>>>> 
>>>>> On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <ja...@apache.org>
>>>> wrote:
>>>>> 
>>>>> +1
>>>>> 
>>>>> On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <we...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> hello,
>>>>>> 
>>>>>> For the current iteration of Arrow, can we agree to support int64 UNIX
>>>>>> timestamps with a particular resolution (second through nanosecond),
>>>>>> as these are reasonably common representations? We can look to expand
>>>>>> later if it is needed.
>>>>>> 
>>>>>> Thanks
>>>>>> Wes
>>>>>> 
>>>>>> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <we...@gmail.com>
>>>> wrote:
>>>>>>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
>>>>>>> purposes of moving data between systems, at minimum) we should
>>>> propose
>>>>>>> timestamp metadata and physical memory representation that maximizes
>>>>>>> interoperability with other systems. It seems like a fixed decimal
>>>>>>> would meet this requirement as UNIX-like timestamps at some
>>>> resolution
>>>>>>> could pass unmodified with appropriate metadata.
>>>>>>> 
>>>>>>> We will also need decimal types in Arrow (at least to accommodate
>>>>>>> common database representations and file formats like Parquet), so
>>>>>>> this seems like a reasonable potential hierarchy of types:
>>>>>>> 
>>>>>>> Timestamp [logical type]
>>>>>>> extends FixedDecimal [logical type]
>>>>>>> extends FixedWidth [physical type]
>>>>>>> 
>>>>>>> I did a bit of internet searching but did not find a canonical
>>>>>>> reference or implementation of fixed decimals; that would be helpful.
>>>>>>> 
>>>>>>> As an aside: for floating decimal numbers for numerical data we could
>>>>>>> utilize an implementation like http://www.bytereef.org/mpdecimal/
>>>>>>> which implements the spec described at
>>>>>>> http://speleotrove.com/decimal/decarith.html
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Wes
>>>>>>> 
>>>>>>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <al...@alexsamuel.net>
>>>>>> wrote:
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> May I suggest that instead of fixed-point decimals, you consider a
>>>> more
>>>>>>>> general fixed-denominator rational representation, for times and
>>>> other
>>>>>>>> purposes? Powers of ten are convenient for humans, but powers of two
>>>>>> more
>>>>>>>> efficient. For some applications, the efficiency of bit operations
>>>> over
>>>>>>>> divmod is more useful than an exact representation of integral
>>>>>> nanoseconds.
>>>>>>>> 
>>>>>>>> std::chrono takes this approach. I'll also humbly point you at my
>>>> own
>>>>>>>> date/time library, https://github.com/alexhsamuel/cron (incomplete
>>>> but
>>>>>>>> basically working), which may provide ideas or useful code. It was
>>>>>> intended
>>>>>>>> for precisely this sort of application.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Alex
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
>>>>>>>> 
>>>>>>>>> I agree with that having a Decimal type for timestamps is a nice
>>>>>>>>> definition. Haying your time encoded as seconds or nanoseconds
>>>> should
>>>>>> be
>>>>>>>>> the same as having a scale of the respective amount. But I would
>>>> rather
>>>>>>>>> avoid having a separate decimal physical type. Therefore I'd
>>>> prefer the
>>>>>>>>> parquet approach where decimal is only a logical type and backed by
>>>>>>>>> either a bytearray, int32 or int64.
>>>>>>>>> 
>>>>>>>>> Thus a more general timestamp could look like:
>>>>>>>>> 
>>>>>>>>> * Decimals are logical types, physical types are the same as
>>>> defined in
>>>>>>>>> Parquet [1]
>>>>>>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>>>>>>>> nanoseconds by using a different scale. .(Note that seconds and so
>>>> on
>>>>>>>>> are all powers of ten, thus matching the specification of decimal
>>>> scale
>>>>>>>>> really good).
>>>>>>>>> * Timestamp is just another logical type that is referring to
>>>> Decimal
>>>>>>>>> (and optionally may have a timezone) and signalling that we have a
>>>> Time
>>>>>>>>> and not just a "simple" decimal.
>>>>>>>>> * For a first iteration, I would assume no timezone or UTC but not
>>>>>>>>> include a metadata field. Once we're sure the implementation
>>>> works, we
>>>>>>>>> can add metadata about it.
>>>>>>>>> 
>>>>>>>>> Timedeltas could be addressed in a similar way, just without the
>>>> need
>>>>>>>>> for a timezone.
>>>>>>>>> 
>>>>>>>>> For my usages, I don't have the use-case for a larger than int64
>>>>>>>>> timestamp and would like to have it exactly as such in my
>>>> computation,
>>>>>>>>> thus my preference for the Parquet way.
>>>>>>>>> 
>>>>>>>>> Uwe
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>>>> https://github.com/apache/parquet-format/blob/master/
>>>>>> LogicalTypes.md#decimal
>>>>>>>>> 
>>>>>>>>> On 13.07.16 03:06, Julian Hyde wrote:
>>>>>>>>>> I'm talking about a fixed decimal type, not floating decimal.
>>>> (Oracle
>>>>>>>>>> numbers are floating decimal. They have a few nice properties, but
>>>>>>>>>> they are variable width and can get quite large. I've seen one or
>>>> two
>>>>>>>>>> systems that started with binary flo
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>>>>>>> 
>>>>>>>> nanoseconds by using a different scale. .(Note that seconds and so
>>>> on
>>>>>>>> 
>>>>>>>> are all powers of ten, thus matching the specification of decimal
>>>> scale
>>>>>>>> 
>>>>>>>> really good).
>>>>>>>> 
>>>>>>>> * Timestamp is just another logical type that is referring to
>>>> Decimal
>>>>>>>> 
>>>>>>>> (and optionally may have a timezone) and signalling that we have a
>>>> Tim
>>>>>>>> 
>>>>>>>> ating point numbers, which are
>>>>>>>>>> much worse for business computing, and then change to Java
>>>>>> BigDecimal,
>>>>>>>>>> which gives the right answer but are horribly inefficient.)
>>>>>>>>>> 
>>>>>>>>>> A fixed decimal type has virtually zero computational overhead. It
>>>>>>>>>> just has a piece of metadata saying something like "every value in
>>>>>>>>>> this field is multiplied by 1 million" and leaves it to the client
>>>>>>>>>> program to do that multiplying.
>>>>>>>>>> 
>>>>>>>>>> My advice is to create a good fixed decimal type and lean on it
>>>>>> heavily.
>>>>>>>>>> 
>>>>>>>>>> Julian
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Julien
>>> 
>> 
>> 
>> 
>> --
>> Julien
>> 
> 
> 
> 
> -- 
> Julien


Re: Timestamps with different precision / Timedeltas

Posted by Julien Le Dem <ju...@dremio.com>.
Here is a PR for the change in timestamp:
https://github.com/apache/arrow/pull/156

We should also clarify Date:
 https://issues.apache.org/jira/browse/ARROW-316

On Mon, Oct 3, 2016 at 3:23 PM, Julien Le Dem <ju...@dremio.com> wrote:

> I created a JIRA for the Timestamp type if you want to comment in it:
> https://issues.apache.org/jira/browse/ARROW-315
>
> On Mon, Oct 3, 2016 at 3:16 PM, Julien Le Dem <ju...@dremio.com> wrote:
>
>> consistency with Parquet a +
>> Parquet supports timestamp millis and micros (no nanos)
>> https://github.com/apache/parquet-format/blob/master/Logical
>> Types.md#datetime-types
>>
>> currently Arrow timestamps have a timezone field.
>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L67
>> Wes: regarding your suggestion do we want to change timestamp as follows?
>> - remove "timestamp" field and say it's UTC
>> - add unit field (MICROS | MILLIS)
>>
>>
>>
>> On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss <do...@gmail.com>
>> wrote:
>>
>>> +1 for nano or milli, or something else?
>>>
>>> TL;DR;
>>>
>>> epochMilli++
>>>
>>> —
>>>
>>> Wes, the hierarchy is eminently reasonable, so +1 from me for that.
>>> Regarding your aside, I am also a fan of the
>>> http://speleotrove.com/decimal/decarith.html <
>>> http://speleotrove.com/decimal/decarith.html> specification, though I
>>> must admit I am biased simply because it addresses the Rexx Lost Digits
>>> condition.
>>>
>>> The most commonly used timestamps I see are stored as epoch
>>> milliseconds, or epochMillis.  It may not be canonical, however there are
>>> many billions of devices and software applications utilizing it.
>>>
>>> To support extremely fine grained DateTime representations, particularly
>>> in common scientific applications, I’m for _epochNano_, with logical
>>> casting to work with existing datasets that are in epochMilli instead.  We
>>> can deal with the rollover in 300k years.
>>>
>>> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z,
>>> I doubt it will ever happen. No, I’m not a millennial.
>>>
>>> My only concern is for use of 64-bit logical DateTime at the small
>>> Physics level.  For that use case, UT2 is more appropriate; measurements
>>> are frequently in fractions of nanoseconds.  Perhaps there could be a way
>>> to logically cast a signed int96, which is supported by Parquet.
>>>
>>> Timestamp [logical type]
>>> extends FixedDecimal [logical type] (int64)
>>> extends FixedWidth [physical type] byteArray[8]
>>>
>>> Timestamp96 [logical type]
>>> extends FixedDecimal [logical type] (int96)
>>> extends FixedWidth [physical type] byteArray[12]
>>>
>>> —
>>>
>>> Although inappurtenant to this specific discussion, I would like to see
>>> a standardized DateTime specification that uses a signed int64 as the
>>> decimal epochSecond and an unsigned int96 as the fractional representation
>>> of a second.
>>>
>>> TimestampHiggs [logical type]
>>> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of
>>> 2 columns, the fixed decimal epochSecond and the fractional second as
>>> (n/2^96).
>>> extends FixedWidth [physical type] byteArray[8], byteArray[12]
>>>
>>> —Donald
>>>
>>> > On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >
>>> >> hello,
>>> >>
>>> >> For the current iteration of Arrow, can we agree to support int64 UNIX
>>> >> timestamps with a particular resolution (second through nanosecond),
>>> >> as these are reasonably common representations? We can look to expand
>>> >> later if it is needed.
>>> >>
>>> >> Thanks
>>> >> Wes
>>> >>
>>> >> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
>>> >>> purposes of moving data between systems, at minimum) we should
>>> propose
>>> >>> timestamp metadata and physical memory representation that maximizes
>>> >>> interoperability with other systems. It seems like a fixed decimal
>>> >>> would meet this requirement as UNIX-like timestamps at some
>>> resolution
>>> >>> could pass unmodified with appropriate metadata.
>>> >>>
>>> >>> We will also need decimal types in Arrow (at least to accommodate
>>> >>> common database representations and file formats like Parquet), so
>>> >>> this seems like a reasonable potential hierarchy of types:
>>> >>>
>>> >>> Timestamp [logical type]
>>> >>> extends FixedDecimal [logical type]
>>> >>> extends FixedWidth [physical type]
>>> >>>
>>> >>> I did a bit of internet searching but did not find a canonical
>>> >>> reference or implementation of fixed decimals; that would be helpful.
>>> >>>
>>> >>> As an aside: for floating decimal numbers for numerical data we could
>>> >>> utilize an implementation like http://www.bytereef.org/mpdecimal/
>>> >>> which implements the spec described at
>>> >>> http://speleotrove.com/decimal/decarith.html
>>> >>>
>>> >>> Thanks
>>> >>> Wes
>>> >>>
>>> >>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <al...@alexsamuel.net>
>>> >> wrote:
>>> >>>> Hi all,
>>> >>>>
>>> >>>> May I suggest that instead of fixed-point decimals, you consider a
>>> more
>>> >>>> general fixed-denominator rational representation, for times and
>>> other
>>> >>>> purposes? Powers of ten are convenient for humans, but powers of two
>>> >> more
>>> >>>> efficient. For some applications, the efficiency of bit operations
>>> over
>>> >>>> divmod is more useful than an exact representation of integral
>>> >> nanoseconds.
>>> >>>>
>>> >>>> std::chrono takes this approach. I'll also humbly point you at my
>>> own
>>> >>>> date/time library, https://github.com/alexhsamuel/cron (incomplete
>>> but
>>> >>>> basically working), which may provide ideas or useful code. It was
>>> >> intended
>>> >>>> for precisely this sort of application.
>>> >>>>
>>> >>>> Regards,
>>> >>>> Alex
>>> >>>>
>>> >>>>
>>> >>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
>>> >>>>
>>> >>>>> I agree with that having a Decimal type for timestamps is a nice
>>> >>>>> definition. Haying your time encoded as seconds or nanoseconds
>>> should
>>> >> be
>>> >>>>> the same as having a scale of the respective amount. But I would
>>> rather
>>> >>>>> avoid having a separate decimal physical type. Therefore I'd
>>> prefer the
>>> >>>>> parquet approach where decimal is only a logical type and backed by
>>> >>>>> either a bytearray, int32 or int64.
>>> >>>>>
>>> >>>>> Thus a more general timestamp could look like:
>>> >>>>>
>>> >>>>> * Decimals are logical types, physical types are the same as
>>> defined in
>>> >>>>> Parquet [1]
>>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>> >>>>> nanoseconds by using a different scale. .(Note that seconds and so
>>> on
>>> >>>>> are all powers of ten, thus matching the specification of decimal
>>> scale
>>> >>>>> really good).
>>> >>>>> * Timestamp is just another logical type that is referring to
>>> Decimal
>>> >>>>> (and optionally may have a timezone) and signalling that we have a
>>> Time
>>> >>>>> and not just a "simple" decimal.
>>> >>>>> * For a first iteration, I would assume no timezone or UTC but not
>>> >>>>> include a metadata field. Once we're sure the implementation
>>> works, we
>>> >>>>> can add metadata about it.
>>> >>>>>
>>> >>>>> Timedeltas could be addressed in a similar way, just without the
>>> need
>>> >>>>> for a timezone.
>>> >>>>>
>>> >>>>> For my usages, I don't have the use-case for a larger than int64
>>> >>>>> timestamp and would like to have it exactly as such in my
>>> computation,
>>> >>>>> thus my preference for the Parquet way.
>>> >>>>>
>>> >>>>> Uwe
>>> >>>>>
>>> >>>>> [1]
>>> >>>>>
>>> >>>>> https://github.com/apache/parquet-format/blob/master/
>>> >> LogicalTypes.md#decimal
>>> >>>>>
>>> >>>>> On 13.07.16 03:06, Julian Hyde wrote:
>>> >>>>>> I'm talking about a fixed decimal type, not floating decimal.
>>> (Oracle
>>> >>>>>> numbers are floating decimal. They have a few nice properties, but
>>> >>>>>> they are variable width and can get quite large. I've seen one or
>>> two
>>> >>>>>> systems that started with binary flo
>>> >>>>
>>> >>>>
>>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>> >>>>
>>> >>>> nanoseconds by using a different scale. .(Note that seconds and so
>>> on
>>> >>>>
>>> >>>> are all powers of ten, thus matching the specification of decimal
>>> scale
>>> >>>>
>>> >>>> really good).
>>> >>>>
>>> >>>> * Timestamp is just another logical type that is referring to
>>> Decimal
>>> >>>>
>>> >>>> (and optionally may have a timezone) and signalling that we have a
>>> Tim
>>> >>>>
>>> >>>> ating point numbers, which are
>>> >>>>>> much worse for business computing, and then change to Java
>>> >> BigDecimal,
>>> >>>>>> which gives the right answer but are horribly inefficient.)
>>> >>>>>>
>>> >>>>>> A fixed decimal type has virtually zero computational overhead. It
>>> >>>>>> just has a piece of metadata saying something like "every value in
>>> >>>>>> this field is multiplied by 1 million" and leaves it to the client
>>> >>>>>> program to do that multiplying.
>>> >>>>>>
>>> >>>>>> My advice is to create a good fixed decimal type and lean on it
>>> >> heavily.
>>> >>>>>>
>>> >>>>>> Julian
>>> >>>>>>
>>> >>>>>
>>> >>>>>
>>> >>
>>>
>>>
>>
>>
>> --
>> Julien
>>
>
>
>
> --
> Julien
>



-- 
Julien

Re: Timestamps with different precision / Timedeltas

Posted by Julien Le Dem <ju...@dremio.com>.
I created a JIRA for the Timestamp type if you want to comment in it:
https://issues.apache.org/jira/browse/ARROW-315

On Mon, Oct 3, 2016 at 3:16 PM, Julien Le Dem <ju...@dremio.com> wrote:

> consistency with Parquet a +
> Parquet supports timestamp millis and micros (no nanos)
> https://github.com/apache/parquet-format/blob/master/
> LogicalTypes.md#datetime-types
>
> currently Arrow timestamps have a timezone field.
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L67
> Wes: regarding your suggestion do we want to change timestamp as follows?
> - remove "timestamp" field and say it's UTC
> - add unit field (MICROS | MILLIS)
>
>
>
> On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss <do...@gmail.com>
> wrote:
>
>> +1 for nano or milli, or something else?
>>
>> TL;DR;
>>
>> epochMilli++
>>
>> —
>>
>> Wes, the hierarchy is eminently reasonable, so +1 from me for that.
>> Regarding your aside, I am also a fan of the
>> http://speleotrove.com/decimal/decarith.html <
>> http://speleotrove.com/decimal/decarith.html> specification, though I
>> must admit I am biased simply because it addresses the Rexx Lost Digits
>> condition.
>>
>> The most commonly used timestamps I see are stored as epoch milliseconds,
>> or epochMillis.  It may not be canonical, however there are many billions
>> of devices and software applications utilizing it.
>>
>> To support extremely fine grained DateTime representations, particularly
>> in common scientific applications, I’m for _epochNano_, with logical
>> casting to work with existing datasets that are in epochMilli instead.  We
>> can deal with the rollover in 300k years.
>>
>> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z, I
>> doubt it will ever happen. No, I’m not a millennial.
>>
>> My only concern is for use of 64-bit logical DateTime at the small
>> Physics level.  For that use case, UT2 is more appropriate; measurements
>> are frequently in fractions of nanoseconds.  Perhaps there could be a way
>> to logically cast a signed int96, which is supported by Parquet.
>>
>> Timestamp [logical type]
>> extends FixedDecimal [logical type] (int64)
>> extends FixedWidth [physical type] byteArray[8]
>>
>> Timestamp96 [logical type]
>> extends FixedDecimal [logical type] (int96)
>> extends FixedWidth [physical type] byteArray[12]
>>
>> —
>>
>> Although inappurtenant to this specific discussion, I would like to see a
>> standardized DateTime specification that uses a signed int64 as the decimal
>> epochSecond and an unsigned int96 as the fractional representation of a
>> second.
>>
>> TimestampHiggs [logical type]
>> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of 2
>> columns, the fixed decimal epochSecond and the fractional second as
>> (n/2^96).
>> extends FixedWidth [physical type] byteArray[8], byteArray[12]
>>
>> —Donald
>>
>> > On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <ja...@apache.org> wrote:
>> >
>> > +1
>> >
>> > On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> >> hello,
>> >>
>> >> For the current iteration of Arrow, can we agree to support int64 UNIX
>> >> timestamps with a particular resolution (second through nanosecond),
>> >> as these are reasonably common representations? We can look to expand
>> >> later if it is needed.
>> >>
>> >> Thanks
>> >> Wes
>> >>
>> >> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
>> >>> purposes of moving data between systems, at minimum) we should propose
>> >>> timestamp metadata and physical memory representation that maximizes
>> >>> interoperability with other systems. It seems like a fixed decimal
>> >>> would meet this requirement as UNIX-like timestamps at some resolution
>> >>> could pass unmodified with appropriate metadata.
>> >>>
>> >>> We will also need decimal types in Arrow (at least to accommodate
>> >>> common database representations and file formats like Parquet), so
>> >>> this seems like a reasonable potential hierarchy of types:
>> >>>
>> >>> Timestamp [logical type]
>> >>> extends FixedDecimal [logical type]
>> >>> extends FixedWidth [physical type]
>> >>>
>> >>> I did a bit of internet searching but did not find a canonical
>> >>> reference or implementation of fixed decimals; that would be helpful.
>> >>>
>> >>> As an aside: for floating decimal numbers for numerical data we could
>> >>> utilize an implementation like http://www.bytereef.org/mpdecimal/
>> >>> which implements the spec described at
>> >>> http://speleotrove.com/decimal/decarith.html
>> >>>
>> >>> Thanks
>> >>> Wes
>> >>>
>> >>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <al...@alexsamuel.net>
>> >> wrote:
>> >>>> Hi all,
>> >>>>
>> >>>> May I suggest that instead of fixed-point decimals, you consider a
>> more
>> >>>> general fixed-denominator rational representation, for times and
>> other
>> >>>> purposes? Powers of ten are convenient for humans, but powers of two
>> >> more
>> >>>> efficient. For some applications, the efficiency of bit operations
>> over
>> >>>> divmod is more useful than an exact representation of integral
>> >> nanoseconds.
>> >>>>
>> >>>> std::chrono takes this approach. I'll also humbly point you at my own
>> >>>> date/time library, https://github.com/alexhsamuel/cron (incomplete
>> but
>> >>>> basically working), which may provide ideas or useful code. It was
>> >> intended
>> >>>> for precisely this sort of application.
>> >>>>
>> >>>> Regards,
>> >>>> Alex
>> >>>>
>> >>>>
>> >>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
>> >>>>
>> >>>>> I agree with that having a Decimal type for timestamps is a nice
>> >>>>> definition. Haying your time encoded as seconds or nanoseconds
>> should
>> >> be
>> >>>>> the same as having a scale of the respective amount. But I would
>> rather
>> >>>>> avoid having a separate decimal physical type. Therefore I'd prefer
>> the
>> >>>>> parquet approach where decimal is only a logical type and backed by
>> >>>>> either a bytearray, int32 or int64.
>> >>>>>
>> >>>>> Thus a more general timestamp could look like:
>> >>>>>
>> >>>>> * Decimals are logical types, physical types are the same as
>> defined in
>> >>>>> Parquet [1]
>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>> >>>>> nanoseconds by using a different scale. .(Note that seconds and so
>> on
>> >>>>> are all powers of ten, thus matching the specification of decimal
>> scale
>> >>>>> really good).
>> >>>>> * Timestamp is just another logical type that is referring to
>> Decimal
>> >>>>> (and optionally may have a timezone) and signalling that we have a
>> Time
>> >>>>> and not just a "simple" decimal.
>> >>>>> * For a first iteration, I would assume no timezone or UTC but not
>> >>>>> include a metadata field. Once we're sure the implementation works,
>> we
>> >>>>> can add metadata about it.
>> >>>>>
>> >>>>> Timedeltas could be addressed in a similar way, just without the
>> need
>> >>>>> for a timezone.
>> >>>>>
>> >>>>> For my usages, I don't have the use-case for a larger than int64
>> >>>>> timestamp and would like to have it exactly as such in my
>> computation,
>> >>>>> thus my preference for the Parquet way.
>> >>>>>
>> >>>>> Uwe
>> >>>>>
>> >>>>> [1]
>> >>>>>
>> >>>>> https://github.com/apache/parquet-format/blob/master/
>> >> LogicalTypes.md#decimal
>> >>>>>
>> >>>>> On 13.07.16 03:06, Julian Hyde wrote:
>> >>>>>> I'm talking about a fixed decimal type, not floating decimal.
>> (Oracle
>> >>>>>> numbers are floating decimal. They have a few nice properties, but
>> >>>>>> they are variable width and can get quite large. I've seen one or
>> two
>> >>>>>> systems that started with binary flo
>> >>>>
>> >>>>
>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>> >>>>
>> >>>> nanoseconds by using a different scale. .(Note that seconds and so on
>> >>>>
>> >>>> are all powers of ten, thus matching the specification of decimal
>> scale
>> >>>>
>> >>>> really good).
>> >>>>
>> >>>> * Timestamp is just another logical type that is referring to Decimal
>> >>>>
>> >>>> (and optionally may have a timezone) and signalling that we have a
>> Tim
>> >>>>
>> >>>> ating point numbers, which are
>> >>>>>> much worse for business computing, and then change to Java
>> >> BigDecimal,
>> >>>>>> which gives the right answer but are horribly inefficient.)
>> >>>>>>
>> >>>>>> A fixed decimal type has virtually zero computational overhead. It
>> >>>>>> just has a piece of metadata saying something like "every value in
>> >>>>>> this field is multiplied by 1 million" and leaves it to the client
>> >>>>>> program to do that multiplying.
>> >>>>>>
>> >>>>>> My advice is to create a good fixed decimal type and lean on it
>> >> heavily.
>> >>>>>>
>> >>>>>> Julian
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>
>>
>>
>
>
> --
> Julien
>



-- 
Julien