You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2016/07/05 23:21:00 UTC

Re: Timestamps with different precision / Timedeltas

Is it worth doing a review of different file formats and database
systems to decide on a timestamp implementation (int64 or int96 with
some resolution seems to be quite popular as well)? At least in the
Arrow C++ codebase, we need to add decimal handling logic anyway.

On Mon, Jun 27, 2016 at 5:20 PM, Julian Hyde <jh...@apache.org> wrote:
> SQL allows timestamps to be stored with any precision (i.e. number of digits after the decimal point) between 0 and 9. That strongly indicates to me that the right implementation of timestamps is as (fixed point) decimal values.
>
> Then devote your efforts to getting the decimal type working correctly.
>
>
>> On Jun 27, 2016, at 3:16 PM, Wes McKinney <we...@gmail.com> wrote:
>>
>> hi Uwe,
>>
>> Thanks for bringing this up. So far we've largely been skirting the
>> "Logical Types Rabbit Hole", but it would be good to start a document
>> collecting requirements for various logical types (e.g. timestamps) so
>> that we can attempt to achieve good solutions on the first try based
>> on the experiences (good and bad) of other projects.
>>
>> In the IPC flatbuffers metadata spec that we drafted for discussion /
>> prototype implementation earlier this year [1], we do have a Timestamp
>> logical type containing only a timezone optional field [2]. If you
>> contrast this with Feather (which uses Arrow's physical memory layout,
>> but custom metadata to suit Python/R needs), that has both a unit and
>> timezone [3].
>>
>> Since there is little consensus in the units of timestamps (more
>> consensus around the UNIX 1970-01-01 epoch, but not even 100%
>> uniformity), I believe the best route would be to add a unit to the
>> metadata to indicates second through nanosecond resolution. Same goes
>> for a Time type.
>>
>> For example, Parquet has both milliseconds and microseconds (in
>> Parquet 2.0). But earlier versions of Parquet don't have this at all
>> [4]. Other systems like Hive and Impala are relying on their own table
>> metadata to convert back and forth (e.g. embedding timestamps of
>> whatever resolution in int64 or int96).
>>
>> For Python pandas that want to use Parquet files (via Arrow) in their
>> workflow, we're stuck with a couple options:
>>
>> 1) Drop sub-microsecond nanos and store timestamps as TIMESTAMP_MICROS
>> (or MILLIS? Not all Parquet readers may be aware of the new
>> microsecond ConvertedType)
>> 2) Store nanosecond timestamps as INT64 and add a bespoke entry to
>> ColumnMetaData::key_value_metadata (it's better than nothing?).
>>
>> I see use cases for both of these -- for Option 1, you may care about
>> interoperability with another system that uses Parquet. For Option 2,
>> you may care about preserving the fidelity of your pandas data.
>> Realistically, #1 seems like the best default option. It makes sense
>> to offer #2 as an option.
>>
>> I don't think addressing time zones in the first pass is strictly
>> necessary, but as long as we store timestamps as UTC, we can also put
>> the time zone in the KeyValue metadata.
>>
>> I'm not sure about the Interval type -- let's create a JIRA and tackle
>> that in a separate discussion. I agree that it merits inclusion as a
>> logical type, but I'm not sure what storage representation makes the
>> most sense (e.g. is is not clear to me why Parquet does not store the
>> interval as an absolute number of milliseconds; perhaps to accommodate
>> month-based intervals which may have different absolute lengths
>> depending on where you start).
>>
>> Let me know what you think, and if others have thoughts I'd be interested too.
>>
>> thanks,
>> Wes
>>
>> [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs
>> [2] : https://github.com/apache/arrow/blob/master/format/Message.fbs#L51
>> [3]: https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs#L78
>> [4]: https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift
>>
>> On Tue, Jun 21, 2016 at 1:40 PM, Uwe Korn <uw...@xhochy.com> wrote:
>>> Hello,
>>>
>>> in addition to categoricals, we also miss at the moment a conversion from
>>> Timestamps in Pandas/NumPy to Arrow. Currently we only have two (exact)
>>> resolutions for them: DATE for days and TIMESTAMP for milliseconds. As
>>> https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html notes there
>>> are several more. We do not need to cater for all but at least some of them.
>>> Therefore I have the following questions which I like to have solved in some
>>> form before implementing:
>>>
>>> * Do we want to cater for other resolutions?
>>> * If we do not provide, e.g. nanosecond resolution (sadly the default
>>>   in Pandas), do we cast with precision loss to the nearest match? Or
>>>   should we force the user to do it?
>>> * Not so important for me at the moment: Do we want to support time zones?
>>>
>>> My current objective is to have them for Parquet file writing. Sadly this
>>> has the same limitations. So the two main options seem to be
>>>
>>> * "roundtrip will only yield correct timezone and logical type if we
>>>   read with Arrow/Pandas again (as we use "proprietary" metadata to
>>>   encode it)"
>>> * "we restrict us to milliseconds and days as resolution" (for the
>>>   latter option, we need to decide how graceful we want to be in the
>>>   Pandas<->Arrow conversion).
>>>
>>> Further datatype we have not yet in Arrow but partly in Parquet is timedelta
>>> (or INTERVAL in Parquet). Probably we need to add another logical type to
>>> Arrow to implement them. Open for suggestions here, too.
>>>
>>> Also in the Arrow spec there is TIME which seems to be the same as TIMESTAMP
>>> (as far as the comments in the C++ code goes). Is there maybe some
>>> distinction I'm missing?
>>>
>>> Cheers
>>>
>>> Uwe
>>>
>

Re: Timestamps with different precision / Timedeltas

Posted by Julian Hyde <jh...@apache.org>.

In SQL, date-time values have no timezone, and they are not implicitly UTC. It is up to the user to supply a timezone. Sounds like what you are proposing is a moment in time (similar to Unix time, and what Joda calls an “instant”). That’s fine, but be aware that you are diverging from SQL.

> On Oct 3, 2016, at 4:32 PM, Julien Le Dem <ju...@dremio.com> wrote:
> 
> Here is a PR for the change in timestamp:
> https://github.com/apache/arrow/pull/156
> 
> We should also clarify Date:
> https://issues.apache.org/jira/browse/ARROW-316
> 
> On Mon, Oct 3, 2016 at 3:23 PM, Julien Le Dem <ju...@dremio.com> wrote:
> 
>> I created a JIRA for the Timestamp type if you want to comment in it:
>> https://issues.apache.org/jira/browse/ARROW-315
>> 
>> On Mon, Oct 3, 2016 at 3:16 PM, Julien Le Dem <ju...@dremio.com> wrote:
>> 
>>> consistency with Parquet a +
>>> Parquet supports timestamp millis and micros (no nanos)
>>> https://github.com/apache/parquet-format/blob/master/Logical
>>> Types.md#datetime-types
>>> 
>>> currently Arrow timestamps have a timezone field.
>>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L67
>>> Wes: regarding your suggestion do we want to change timestamp as follows?
>>> - remove "timestamp" field and say it's UTC
>>> - add unit field (MICROS | MILLIS)
>>> 
>>> 
>>> 
>>> On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss <do...@gmail.com>
>>> wrote:
>>> 
>>>> +1 for nano or milli, or something else?
>>>> 
>>>> TL;DR;
>>>> 
>>>> epochMilli++
>>>> 
>>>> —
>>>> 
>>>> Wes, the hierarchy is eminently reasonable, so +1 from me for that.
>>>> Regarding your aside, I am also a fan of the
>>>> http://speleotrove.com/decimal/decarith.html <
>>>> http://speleotrove.com/decimal/decarith.html> specification, though I
>>>> must admit I am biased simply because it addresses the Rexx Lost Digits
>>>> condition.
>>>> 
>>>> The most commonly used timestamps I see are stored as epoch
>>>> milliseconds, or epochMillis.  It may not be canonical, however there are
>>>> many billions of devices and software applications utilizing it.
>>>> 
>>>> To support extremely fine grained DateTime representations, particularly
>>>> in common scientific applications, I’m for _epochNano_, with logical
>>>> casting to work with existing datasets that are in epochMilli instead.  We
>>>> can deal with the rollover in 300k years.
>>>> 
>>>> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z,
>>>> I doubt it will ever happen. No, I’m not a millennial.
>>>> 
>>>> My only concern is for use of 64-bit logical DateTime at the small
>>>> Physics level.  For that use case, UT2 is more appropriate; measurements
>>>> are frequently in fractions of nanoseconds.  Perhaps there could be a way
>>>> to logically cast a signed int96, which is supported by Parquet.
>>>> 
>>>> Timestamp [logical type]
>>>> extends FixedDecimal [logical type] (int64)
>>>> extends FixedWidth [physical type] byteArray[8]
>>>> 
>>>> Timestamp96 [logical type]
>>>> extends FixedDecimal [logical type] (int96)
>>>> extends FixedWidth [physical type] byteArray[12]
>>>> 
>>>> —
>>>> 
>>>> Although inappurtenant to this specific discussion, I would like to see
>>>> a standardized DateTime specification that uses a signed int64 as the
>>>> decimal epochSecond and an unsigned int96 as the fractional representation
>>>> of a second.
>>>> 
>>>> TimestampHiggs [logical type]
>>>> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of
>>>> 2 columns, the fixed decimal epochSecond and the fractional second as
>>>> (n/2^96).
>>>> extends FixedWidth [physical type] byteArray[8], byteArray[12]
>>>> 
>>>> —Donald
>>>> 
>>>>> On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <ja...@apache.org>
>>>> wrote:
>>>>> 
>>>>> +1
>>>>> 
>>>>> On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <we...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> hello,
>>>>>> 
>>>>>> For the current iteration of Arrow, can we agree to support int64 UNIX
>>>>>> timestamps with a particular resolution (second through nanosecond),
>>>>>> as these are reasonably common representations? We can look to expand
>>>>>> later if it is needed.
>>>>>> 
>>>>>> Thanks
>>>>>> Wes
>>>>>> 
>>>>>> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <we...@gmail.com>
>>>> wrote:
>>>>>>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
>>>>>>> purposes of moving data between systems, at minimum) we should
>>>> propose
>>>>>>> timestamp metadata and physical memory representation that maximizes
>>>>>>> interoperability with other systems. It seems like a fixed decimal
>>>>>>> would meet this requirement as UNIX-like timestamps at some
>>>> resolution
>>>>>>> could pass unmodified with appropriate metadata.
>>>>>>> 
>>>>>>> We will also need decimal types in Arrow (at least to accommodate
>>>>>>> common database representations and file formats like Parquet), so
>>>>>>> this seems like a reasonable potential hierarchy of types:
>>>>>>> 
>>>>>>> Timestamp [logical type]
>>>>>>> extends FixedDecimal [logical type]
>>>>>>> extends FixedWidth [physical type]
>>>>>>> 
>>>>>>> I did a bit of internet searching but did not find a canonical
>>>>>>> reference or implementation of fixed decimals; that would be helpful.
>>>>>>> 
>>>>>>> As an aside: for floating decimal numbers for numerical data we could
>>>>>>> utilize an implementation like http://www.bytereef.org/mpdecimal/
>>>>>>> which implements the spec described at
>>>>>>> http://speleotrove.com/decimal/decarith.html
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Wes
>>>>>>> 
>>>>>>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <al...@alexsamuel.net>
>>>>>> wrote:
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> May I suggest that instead of fixed-point decimals, you consider a
>>>> more
>>>>>>>> general fixed-denominator rational representation, for times and
>>>> other
>>>>>>>> purposes? Powers of ten are convenient for humans, but powers of two
>>>>>> more
>>>>>>>> efficient. For some applications, the efficiency of bit operations
>>>> over
>>>>>>>> divmod is more useful than an exact representation of integral
>>>>>> nanoseconds.
>>>>>>>> 
>>>>>>>> std::chrono takes this approach. I'll also humbly point you at my
>>>> own
>>>>>>>> date/time library, https://github.com/alexhsamuel/cron (incomplete
>>>> but
>>>>>>>> basically working), which may provide ideas or useful code. It was
>>>>>> intended
>>>>>>>> for precisely this sort of application.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Alex
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
>>>>>>>> 
>>>>>>>>> I agree with that having a Decimal type for timestamps is a nice
>>>>>>>>> definition. Haying your time encoded as seconds or nanoseconds
>>>> should
>>>>>> be
>>>>>>>>> the same as having a scale of the respective amount. But I would
>>>> rather
>>>>>>>>> avoid having a separate decimal physical type. Therefore I'd
>>>> prefer the
>>>>>>>>> parquet approach where decimal is only a logical type and backed by
>>>>>>>>> either a bytearray, int32 or int64.
>>>>>>>>> 
>>>>>>>>> Thus a more general timestamp could look like:
>>>>>>>>> 
>>>>>>>>> * Decimals are logical types, physical types are the same as
>>>> defined in
>>>>>>>>> Parquet [1]
>>>>>>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>>>>>>>> nanoseconds by using a different scale. .(Note that seconds and so
>>>> on
>>>>>>>>> are all powers of ten, thus matching the specification of decimal
>>>> scale
>>>>>>>>> really good).
>>>>>>>>> * Timestamp is just another logical type that is referring to
>>>> Decimal
>>>>>>>>> (and optionally may have a timezone) and signalling that we have a
>>>> Time
>>>>>>>>> and not just a "simple" decimal.
>>>>>>>>> * For a first iteration, I would assume no timezone or UTC but not
>>>>>>>>> include a metadata field. Once we're sure the implementation
>>>> works, we
>>>>>>>>> can add metadata about it.
>>>>>>>>> 
>>>>>>>>> Timedeltas could be addressed in a similar way, just without the
>>>> need
>>>>>>>>> for a timezone.
>>>>>>>>> 
>>>>>>>>> For my usages, I don't have the use-case for a larger than int64
>>>>>>>>> timestamp and would like to have it exactly as such in my
>>>> computation,
>>>>>>>>> thus my preference for the Parquet way.
>>>>>>>>> 
>>>>>>>>> Uwe
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>>>> https://github.com/apache/parquet-format/blob/master/
>>>>>> LogicalTypes.md#decimal
>>>>>>>>> 
>>>>>>>>> On 13.07.16 03:06, Julian Hyde wrote:
>>>>>>>>>> I'm talking about a fixed decimal type, not floating decimal.
>>>> (Oracle
>>>>>>>>>> numbers are floating decimal. They have a few nice properties, but
>>>>>>>>>> they are variable width and can get quite large. I've seen one or
>>>> two
>>>>>>>>>> systems that started with binary flo
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>>>>>>> 
>>>>>>>> nanoseconds by using a different scale. .(Note that seconds and so
>>>> on
>>>>>>>> 
>>>>>>>> are all powers of ten, thus matching the specification of decimal
>>>> scale
>>>>>>>> 
>>>>>>>> really good).
>>>>>>>> 
>>>>>>>> * Timestamp is just another logical type that is referring to
>>>> Decimal
>>>>>>>> 
>>>>>>>> (and optionally may have a timezone) and signalling that we have a
>>>> Tim
>>>>>>>> 
>>>>>>>> ating point numbers, which are
>>>>>>>>>> much worse for business computing, and then change to Java
>>>>>> BigDecimal,
>>>>>>>>>> which gives the right answer but are horribly inefficient.)
>>>>>>>>>> 
>>>>>>>>>> A fixed decimal type has virtually zero computational overhead. It
>>>>>>>>>> just has a piece of metadata saying something like "every value in
>>>>>>>>>> this field is multiplied by 1 million" and leaves it to the client
>>>>>>>>>> program to do that multiplying.
>>>>>>>>>> 
>>>>>>>>>> My advice is to create a good fixed decimal type and lean on it
>>>>>> heavily.
>>>>>>>>>> 
>>>>>>>>>> Julian
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Julien
>>> 
>> 
>> 
>> 
>> --
>> Julien
>> 
> 
> 
> 
> -- 
> Julien

Re: Timestamps with different precision / Timedeltas

Posted by Julien Le Dem <ju...@dremio.com>.

Here is a PR for the change in timestamp:
https://github.com/apache/arrow/pull/156

We should also clarify Date:
 https://issues.apache.org/jira/browse/ARROW-316

On Mon, Oct 3, 2016 at 3:23 PM, Julien Le Dem <ju...@dremio.com> wrote:

> I created a JIRA for the Timestamp type if you want to comment in it:
> https://issues.apache.org/jira/browse/ARROW-315
>
> On Mon, Oct 3, 2016 at 3:16 PM, Julien Le Dem <ju...@dremio.com> wrote:
>
>> consistency with Parquet a +
>> Parquet supports timestamp millis and micros (no nanos)
>> https://github.com/apache/parquet-format/blob/master/Logical
>> Types.md#datetime-types
>>
>> currently Arrow timestamps have a timezone field.
>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L67
>> Wes: regarding your suggestion do we want to change timestamp as follows?
>> - remove "timestamp" field and say it's UTC
>> - add unit field (MICROS | MILLIS)
>>
>>
>>
>> On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss <do...@gmail.com>
>> wrote:
>>
>>> +1 for nano or milli, or something else?
>>>
>>> TL;DR;
>>>
>>> epochMilli++
>>>
>>> —
>>>
>>> Wes, the hierarchy is eminently reasonable, so +1 from me for that.
>>> Regarding your aside, I am also a fan of the
>>> http://speleotrove.com/decimal/decarith.html <
>>> http://speleotrove.com/decimal/decarith.html> specification, though I
>>> must admit I am biased simply because it addresses the Rexx Lost Digits
>>> condition.
>>>
>>> The most commonly used timestamps I see are stored as epoch
>>> milliseconds, or epochMillis.  It may not be canonical, however there are
>>> many billions of devices and software applications utilizing it.
>>>
>>> To support extremely fine grained DateTime representations, particularly
>>> in common scientific applications, I’m for _epochNano_, with logical
>>> casting to work with existing datasets that are in epochMilli instead.  We
>>> can deal with the rollover in 300k years.
>>>
>>> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z,
>>> I doubt it will ever happen. No, I’m not a millennial.
>>>
>>> My only concern is for use of 64-bit logical DateTime at the small
>>> Physics level.  For that use case, UT2 is more appropriate; measurements
>>> are frequently in fractions of nanoseconds.  Perhaps there could be a way
>>> to logically cast a signed int96, which is supported by Parquet.
>>>
>>> Timestamp [logical type]
>>> extends FixedDecimal [logical type] (int64)
>>> extends FixedWidth [physical type] byteArray[8]
>>>
>>> Timestamp96 [logical type]
>>> extends FixedDecimal [logical type] (int96)
>>> extends FixedWidth [physical type] byteArray[12]
>>>
>>> —
>>>
>>> Although inappurtenant to this specific discussion, I would like to see
>>> a standardized DateTime specification that uses a signed int64 as the
>>> decimal epochSecond and an unsigned int96 as the fractional representation
>>> of a second.
>>>
>>> TimestampHiggs [logical type]
>>> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of
>>> 2 columns, the fixed decimal epochSecond and the fractional second as
>>> (n/2^96).
>>> extends FixedWidth [physical type] byteArray[8], byteArray[12]
>>>
>>> —Donald
>>>
>>> > On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >
>>> >> hello,
>>> >>
>>> >> For the current iteration of Arrow, can we agree to support int64 UNIX
>>> >> timestamps with a particular resolution (second through nanosecond),
>>> >> as these are reasonably common representations? We can look to expand
>>> >> later if it is needed.
>>> >>
>>> >> Thanks
>>> >> Wes
>>> >>
>>> >> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
>>> >>> purposes of moving data between systems, at minimum) we should
>>> propose
>>> >>> timestamp metadata and physical memory representation that maximizes
>>> >>> interoperability with other systems. It seems like a fixed decimal
>>> >>> would meet this requirement as UNIX-like timestamps at some
>>> resolution
>>> >>> could pass unmodified with appropriate metadata.
>>> >>>
>>> >>> We will also need decimal types in Arrow (at least to accommodate
>>> >>> common database representations and file formats like Parquet), so
>>> >>> this seems like a reasonable potential hierarchy of types:
>>> >>>
>>> >>> Timestamp [logical type]
>>> >>> extends FixedDecimal [logical type]
>>> >>> extends FixedWidth [physical type]
>>> >>>
>>> >>> I did a bit of internet searching but did not find a canonical
>>> >>> reference or implementation of fixed decimals; that would be helpful.
>>> >>>
>>> >>> As an aside: for floating decimal numbers for numerical data we could
>>> >>> utilize an implementation like http://www.bytereef.org/mpdecimal/
>>> >>> which implements the spec described at
>>> >>> http://speleotrove.com/decimal/decarith.html
>>> >>>
>>> >>> Thanks
>>> >>> Wes
>>> >>>
>>> >>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <al...@alexsamuel.net>
>>> >> wrote:
>>> >>>> Hi all,
>>> >>>>
>>> >>>> May I suggest that instead of fixed-point decimals, you consider a
>>> more
>>> >>>> general fixed-denominator rational representation, for times and
>>> other
>>> >>>> purposes? Powers of ten are convenient for humans, but powers of two
>>> >> more
>>> >>>> efficient. For some applications, the efficiency of bit operations
>>> over
>>> >>>> divmod is more useful than an exact representation of integral
>>> >> nanoseconds.
>>> >>>>
>>> >>>> std::chrono takes this approach. I'll also humbly point you at my
>>> own
>>> >>>> date/time library, https://github.com/alexhsamuel/cron (incomplete
>>> but
>>> >>>> basically working), which may provide ideas or useful code. It was
>>> >> intended
>>> >>>> for precisely this sort of application.
>>> >>>>
>>> >>>> Regards,
>>> >>>> Alex
>>> >>>>
>>> >>>>
>>> >>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
>>> >>>>
>>> >>>>> I agree with that having a Decimal type for timestamps is a nice
>>> >>>>> definition. Haying your time encoded as seconds or nanoseconds
>>> should
>>> >> be
>>> >>>>> the same as having a scale of the respective amount. But I would
>>> rather
>>> >>>>> avoid having a separate decimal physical type. Therefore I'd
>>> prefer the
>>> >>>>> parquet approach where decimal is only a logical type and backed by
>>> >>>>> either a bytearray, int32 or int64.
>>> >>>>>
>>> >>>>> Thus a more general timestamp could look like:
>>> >>>>>
>>> >>>>> * Decimals are logical types, physical types are the same as
>>> defined in
>>> >>>>> Parquet [1]
>>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>> >>>>> nanoseconds by using a different scale. .(Note that seconds and so
>>> on
>>> >>>>> are all powers of ten, thus matching the specification of decimal
>>> scale
>>> >>>>> really good).
>>> >>>>> * Timestamp is just another logical type that is referring to
>>> Decimal
>>> >>>>> (and optionally may have a timezone) and signalling that we have a
>>> Time
>>> >>>>> and not just a "simple" decimal.
>>> >>>>> * For a first iteration, I would assume no timezone or UTC but not
>>> >>>>> include a metadata field. Once we're sure the implementation
>>> works, we
>>> >>>>> can add metadata about it.
>>> >>>>>
>>> >>>>> Timedeltas could be addressed in a similar way, just without the
>>> need
>>> >>>>> for a timezone.
>>> >>>>>
>>> >>>>> For my usages, I don't have the use-case for a larger than int64
>>> >>>>> timestamp and would like to have it exactly as such in my
>>> computation,
>>> >>>>> thus my preference for the Parquet way.
>>> >>>>>
>>> >>>>> Uwe
>>> >>>>>
>>> >>>>> [1]
>>> >>>>>
>>> >>>>> https://github.com/apache/parquet-format/blob/master/
>>> >> LogicalTypes.md#decimal
>>> >>>>>
>>> >>>>> On 13.07.16 03:06, Julian Hyde wrote:
>>> >>>>>> I'm talking about a fixed decimal type, not floating decimal.
>>> (Oracle
>>> >>>>>> numbers are floating decimal. They have a few nice properties, but
>>> >>>>>> they are variable width and can get quite large. I've seen one or
>>> two
>>> >>>>>> systems that started with binary flo
>>> >>>>
>>> >>>>
>>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>> >>>>
>>> >>>> nanoseconds by using a different scale. .(Note that seconds and so
>>> on
>>> >>>>
>>> >>>> are all powers of ten, thus matching the specification of decimal
>>> scale
>>> >>>>
>>> >>>> really good).
>>> >>>>
>>> >>>> * Timestamp is just another logical type that is referring to
>>> Decimal
>>> >>>>
>>> >>>> (and optionally may have a timezone) and signalling that we have a
>>> Tim
>>> >>>>
>>> >>>> ating point numbers, which are
>>> >>>>>> much worse for business computing, and then change to Java
>>> >> BigDecimal,
>>> >>>>>> which gives the right answer but are horribly inefficient.)
>>> >>>>>>
>>> >>>>>> A fixed decimal type has virtually zero computational overhead. It
>>> >>>>>> just has a piece of metadata saying something like "every value in
>>> >>>>>> this field is multiplied by 1 million" and leaves it to the client
>>> >>>>>> program to do that multiplying.
>>> >>>>>>
>>> >>>>>> My advice is to create a good fixed decimal type and lean on it
>>> >> heavily.
>>> >>>>>>
>>> >>>>>> Julian
>>> >>>>>>
>>> >>>>>
>>> >>>>>
>>> >>
>>>
>>>
>>
>>
>> --
>> Julien
>>
>
>
>
> --
> Julien
>



-- 
Julien

Re: Timestamps with different precision / Timedeltas

Posted by Julien Le Dem <ju...@dremio.com>.

I created a JIRA for the Timestamp type if you want to comment in it:
https://issues.apache.org/jira/browse/ARROW-315

On Mon, Oct 3, 2016 at 3:16 PM, Julien Le Dem <ju...@dremio.com> wrote:

> consistency with Parquet a +
> Parquet supports timestamp millis and micros (no nanos)
> https://github.com/apache/parquet-format/blob/master/
> LogicalTypes.md#datetime-types
>
> currently Arrow timestamps have a timezone field.
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L67
> Wes: regarding your suggestion do we want to change timestamp as follows?
> - remove "timestamp" field and say it's UTC
> - add unit field (MICROS | MILLIS)
>
>
>
> On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss <do...@gmail.com>
> wrote:
>
>> +1 for nano or milli, or something else?
>>
>> TL;DR;
>>
>> epochMilli++
>>
>> —
>>
>> Wes, the hierarchy is eminently reasonable, so +1 from me for that.
>> Regarding your aside, I am also a fan of the
>> http://speleotrove.com/decimal/decarith.html <
>> http://speleotrove.com/decimal/decarith.html> specification, though I
>> must admit I am biased simply because it addresses the Rexx Lost Digits
>> condition.
>>
>> The most commonly used timestamps I see are stored as epoch milliseconds,
>> or epochMillis.  It may not be canonical, however there are many billions
>> of devices and software applications utilizing it.
>>
>> To support extremely fine grained DateTime representations, particularly
>> in common scientific applications, I’m for _epochNano_, with logical
>> casting to work with existing datasets that are in epochMilli instead.  We
>> can deal with the rollover in 300k years.
>>
>> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z, I
>> doubt it will ever happen. No, I’m not a millennial.
>>
>> My only concern is for use of 64-bit logical DateTime at the small
>> Physics level.  For that use case, UT2 is more appropriate; measurements
>> are frequently in fractions of nanoseconds.  Perhaps there could be a way
>> to logically cast a signed int96, which is supported by Parquet.
>>
>> Timestamp [logical type]
>> extends FixedDecimal [logical type] (int64)
>> extends FixedWidth [physical type] byteArray[8]
>>
>> Timestamp96 [logical type]
>> extends FixedDecimal [logical type] (int96)
>> extends FixedWidth [physical type] byteArray[12]
>>
>> —
>>
>> Although inappurtenant to this specific discussion, I would like to see a
>> standardized DateTime specification that uses a signed int64 as the decimal
>> epochSecond and an unsigned int96 as the fractional representation of a
>> second.
>>
>> TimestampHiggs [logical type]
>> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of 2
>> columns, the fixed decimal epochSecond and the fractional second as
>> (n/2^96).
>> extends FixedWidth [physical type] byteArray[8], byteArray[12]
>>
>> —Donald
>>
>> > On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <ja...@apache.org> wrote:
>> >
>> > +1
>> >
>> > On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> >> hello,
>> >>
>> >> For the current iteration of Arrow, can we agree to support int64 UNIX
>> >> timestamps with a particular resolution (second through nanosecond),
>> >> as these are reasonably common representations? We can look to expand
>> >> later if it is needed.
>> >>
>> >> Thanks
>> >> Wes
>> >>
>> >> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
>> >>> purposes of moving data between systems, at minimum) we should propose
>> >>> timestamp metadata and physical memory representation that maximizes
>> >>> interoperability with other systems. It seems like a fixed decimal
>> >>> would meet this requirement as UNIX-like timestamps at some resolution
>> >>> could pass unmodified with appropriate metadata.
>> >>>
>> >>> We will also need decimal types in Arrow (at least to accommodate
>> >>> common database representations and file formats like Parquet), so
>> >>> this seems like a reasonable potential hierarchy of types:
>> >>>
>> >>> Timestamp [logical type]
>> >>> extends FixedDecimal [logical type]
>> >>> extends FixedWidth [physical type]
>> >>>
>> >>> I did a bit of internet searching but did not find a canonical
>> >>> reference or implementation of fixed decimals; that would be helpful.
>> >>>
>> >>> As an aside: for floating decimal numbers for numerical data we could
>> >>> utilize an implementation like http://www.bytereef.org/mpdecimal/
>> >>> which implements the spec described at
>> >>> http://speleotrove.com/decimal/decarith.html
>> >>>
>> >>> Thanks
>> >>> Wes
>> >>>
>> >>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <al...@alexsamuel.net>
>> >> wrote:
>> >>>> Hi all,
>> >>>>
>> >>>> May I suggest that instead of fixed-point decimals, you consider a
>> more
>> >>>> general fixed-denominator rational representation, for times and
>> other
>> >>>> purposes? Powers of ten are convenient for humans, but powers of two
>> >> more
>> >>>> efficient. For some applications, the efficiency of bit operations
>> over
>> >>>> divmod is more useful than an exact representation of integral
>> >> nanoseconds.
>> >>>>
>> >>>> std::chrono takes this approach. I'll also humbly point you at my own
>> >>>> date/time library, https://github.com/alexhsamuel/cron (incomplete
>> but
>> >>>> basically working), which may provide ideas or useful code. It was
>> >> intended
>> >>>> for precisely this sort of application.
>> >>>>
>> >>>> Regards,
>> >>>> Alex
>> >>>>
>> >>>>
>> >>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
>> >>>>
>> >>>>> I agree with that having a Decimal type for timestamps is a nice
>> >>>>> definition. Haying your time encoded as seconds or nanoseconds
>> should
>> >> be
>> >>>>> the same as having a scale of the respective amount. But I would
>> rather
>> >>>>> avoid having a separate decimal physical type. Therefore I'd prefer
>> the
>> >>>>> parquet approach where decimal is only a logical type and backed by
>> >>>>> either a bytearray, int32 or int64.
>> >>>>>
>> >>>>> Thus a more general timestamp could look like:
>> >>>>>
>> >>>>> * Decimals are logical types, physical types are the same as
>> defined in
>> >>>>> Parquet [1]
>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>> >>>>> nanoseconds by using a different scale. .(Note that seconds and so
>> on
>> >>>>> are all powers of ten, thus matching the specification of decimal
>> scale
>> >>>>> really good).
>> >>>>> * Timestamp is just another logical type that is referring to
>> Decimal
>> >>>>> (and optionally may have a timezone) and signalling that we have a
>> Time
>> >>>>> and not just a "simple" decimal.
>> >>>>> * For a first iteration, I would assume no timezone or UTC but not
>> >>>>> include a metadata field. Once we're sure the implementation works,
>> we
>> >>>>> can add metadata about it.
>> >>>>>
>> >>>>> Timedeltas could be addressed in a similar way, just without the
>> need
>> >>>>> for a timezone.
>> >>>>>
>> >>>>> For my usages, I don't have the use-case for a larger than int64
>> >>>>> timestamp and would like to have it exactly as such in my
>> computation,
>> >>>>> thus my preference for the Parquet way.
>> >>>>>
>> >>>>> Uwe
>> >>>>>
>> >>>>> [1]
>> >>>>>
>> >>>>> https://github.com/apache/parquet-format/blob/master/
>> >> LogicalTypes.md#decimal
>> >>>>>
>> >>>>> On 13.07.16 03:06, Julian Hyde wrote:
>> >>>>>> I'm talking about a fixed decimal type, not floating decimal.
>> (Oracle
>> >>>>>> numbers are floating decimal. They have a few nice properties, but
>> >>>>>> they are variable width and can get quite large. I've seen one or
>> two
>> >>>>>> systems that started with binary flo
>> >>>>
>> >>>>
>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>> >>>>
>> >>>> nanoseconds by using a different scale. .(Note that seconds and so on
>> >>>>
>> >>>> are all powers of ten, thus matching the specification of decimal
>> scale
>> >>>>
>> >>>> really good).
>> >>>>
>> >>>> * Timestamp is just another logical type that is referring to Decimal
>> >>>>
>> >>>> (and optionally may have a timezone) and signalling that we have a
>> Tim
>> >>>>
>> >>>> ating point numbers, which are
>> >>>>>> much worse for business computing, and then change to Java
>> >> BigDecimal,
>> >>>>>> which gives the right answer but are horribly inefficient.)
>> >>>>>>
>> >>>>>> A fixed decimal type has virtually zero computational overhead. It
>> >>>>>> just has a piece of metadata saying something like "every value in
>> >>>>>> this field is multiplied by 1 million" and leaves it to the client
>> >>>>>> program to do that multiplying.
>> >>>>>>
>> >>>>>> My advice is to create a good fixed decimal type and lean on it
>> >> heavily.
>> >>>>>>
>> >>>>>> Julian
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>
>>
>>
>
>
> --
> Julien
>



-- 
Julien

Re: Timestamps with different precision / Timedeltas

Posted by Julien Le Dem <ju...@dremio.com>.

consistency with Parquet a +
Parquet supports timestamp millis and micros (no nanos)
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#datetime-types

currently Arrow timestamps have a timezone field.
https://github.com/apache/arrow/blob/master/format/Message.fbs#L67
Wes: regarding your suggestion do we want to change timestamp as follows?
- remove "timestamp" field and say it's UTC
- add unit field (MICROS | MILLIS)



On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss <do...@gmail.com> wrote:

> +1 for nano or milli, or something else?
>
> TL;DR;
>
> epochMilli++
>
> —
>
> Wes, the hierarchy is eminently reasonable, so +1 from me for that.
> Regarding your aside, I am also a fan of the http://speleotrove.com/
> decimal/decarith.html <http://speleotrove.com/decimal/decarith.html>
> specification, though I must admit I am biased simply because it addresses
> the Rexx Lost Digits condition.
>
> The most commonly used timestamps I see are stored as epoch milliseconds,
> or epochMillis.  It may not be canonical, however there are many billions
> of devices and software applications utilizing it.
>
> To support extremely fine grained DateTime representations, particularly
> in common scientific applications, I’m for _epochNano_, with logical
> casting to work with existing datasets that are in epochMilli instead.  We
> can deal with the rollover in 300k years.
>
> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z, I
> doubt it will ever happen. No, I’m not a millennial.
>
> My only concern is for use of 64-bit logical DateTime at the small Physics
> level.  For that use case, UT2 is more appropriate; measurements are
> frequently in fractions of nanoseconds.  Perhaps there could be a way to
> logically cast a signed int96, which is supported by Parquet.
>
> Timestamp [logical type]
> extends FixedDecimal [logical type] (int64)
> extends FixedWidth [physical type] byteArray[8]
>
> Timestamp96 [logical type]
> extends FixedDecimal [logical type] (int96)
> extends FixedWidth [physical type] byteArray[12]
>
> —
>
> Although inappurtenant to this specific discussion, I would like to see a
> standardized DateTime specification that uses a signed int64 as the decimal
> epochSecond and an unsigned int96 as the fractional representation of a
> second.
>
> TimestampHiggs [logical type]
> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of 2
> columns, the fixed decimal epochSecond and the fractional second as
> (n/2^96).
> extends FixedWidth [physical type] byteArray[8], byteArray[12]
>
> —Donald
>
> > On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <ja...@apache.org> wrote:
> >
> > +1
> >
> > On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >
> >> hello,
> >>
> >> For the current iteration of Arrow, can we agree to support int64 UNIX
> >> timestamps with a particular resolution (second through nanosecond),
> >> as these are reasonably common representations? We can look to expand
> >> later if it is needed.
> >>
> >> Thanks
> >> Wes
> >>
> >> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <we...@gmail.com>
> wrote:
> >>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
> >>> purposes of moving data between systems, at minimum) we should propose
> >>> timestamp metadata and physical memory representation that maximizes
> >>> interoperability with other systems. It seems like a fixed decimal
> >>> would meet this requirement as UNIX-like timestamps at some resolution
> >>> could pass unmodified with appropriate metadata.
> >>>
> >>> We will also need decimal types in Arrow (at least to accommodate
> >>> common database representations and file formats like Parquet), so
> >>> this seems like a reasonable potential hierarchy of types:
> >>>
> >>> Timestamp [logical type]
> >>> extends FixedDecimal [logical type]
> >>> extends FixedWidth [physical type]
> >>>
> >>> I did a bit of internet searching but did not find a canonical
> >>> reference or implementation of fixed decimals; that would be helpful.
> >>>
> >>> As an aside: for floating decimal numbers for numerical data we could
> >>> utilize an implementation like http://www.bytereef.org/mpdecimal/
> >>> which implements the spec described at
> >>> http://speleotrove.com/decimal/decarith.html
> >>>
> >>> Thanks
> >>> Wes
> >>>
> >>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <al...@alexsamuel.net>
> >> wrote:
> >>>> Hi all,
> >>>>
> >>>> May I suggest that instead of fixed-point decimals, you consider a
> more
> >>>> general fixed-denominator rational representation, for times and other
> >>>> purposes? Powers of ten are convenient for humans, but powers of two
> >> more
> >>>> efficient. For some applications, the efficiency of bit operations
> over
> >>>> divmod is more useful than an exact representation of integral
> >> nanoseconds.
> >>>>
> >>>> std::chrono takes this approach. I'll also humbly point you at my own
> >>>> date/time library, https://github.com/alexhsamuel/cron (incomplete
> but
> >>>> basically working), which may provide ideas or useful code. It was
> >> intended
> >>>> for precisely this sort of application.
> >>>>
> >>>> Regards,
> >>>> Alex
> >>>>
> >>>>
> >>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
> >>>>
> >>>>> I agree with that having a Decimal type for timestamps is a nice
> >>>>> definition. Haying your time encoded as seconds or nanoseconds should
> >> be
> >>>>> the same as having a scale of the respective amount. But I would
> rather
> >>>>> avoid having a separate decimal physical type. Therefore I'd prefer
> the
> >>>>> parquet approach where decimal is only a logical type and backed by
> >>>>> either a bytearray, int32 or int64.
> >>>>>
> >>>>> Thus a more general timestamp could look like:
> >>>>>
> >>>>> * Decimals are logical types, physical types are the same as defined
> in
> >>>>> Parquet [1]
> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
> >>>>> nanoseconds by using a different scale. .(Note that seconds and so on
> >>>>> are all powers of ten, thus matching the specification of decimal
> scale
> >>>>> really good).
> >>>>> * Timestamp is just another logical type that is referring to Decimal
> >>>>> (and optionally may have a timezone) and signalling that we have a
> Time
> >>>>> and not just a "simple" decimal.
> >>>>> * For a first iteration, I would assume no timezone or UTC but not
> >>>>> include a metadata field. Once we're sure the implementation works,
> we
> >>>>> can add metadata about it.
> >>>>>
> >>>>> Timedeltas could be addressed in a similar way, just without the need
> >>>>> for a timezone.
> >>>>>
> >>>>> For my usages, I don't have the use-case for a larger than int64
> >>>>> timestamp and would like to have it exactly as such in my
> computation,
> >>>>> thus my preference for the Parquet way.
> >>>>>
> >>>>> Uwe
> >>>>>
> >>>>> [1]
> >>>>>
> >>>>> https://github.com/apache/parquet-format/blob/master/
> >> LogicalTypes.md#decimal
> >>>>>
> >>>>> On 13.07.16 03:06, Julian Hyde wrote:
> >>>>>> I'm talking about a fixed decimal type, not floating decimal.
> (Oracle
> >>>>>> numbers are floating decimal. They have a few nice properties, but
> >>>>>> they are variable width and can get quite large. I've seen one or
> two
> >>>>>> systems that started with binary flo
> >>>>
> >>>>
> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
> >>>>
> >>>> nanoseconds by using a different scale. .(Note that seconds and so on
> >>>>
> >>>> are all powers of ten, thus matching the specification of decimal
> scale
> >>>>
> >>>> really good).
> >>>>
> >>>> * Timestamp is just another logical type that is referring to Decimal
> >>>>
> >>>> (and optionally may have a timezone) and signalling that we have a Tim
> >>>>
> >>>> ating point numbers, which are
> >>>>>> much worse for business computing, and then change to Java
> >> BigDecimal,
> >>>>>> which gives the right answer but are horribly inefficient.)
> >>>>>>
> >>>>>> A fixed decimal type has virtually zero computational overhead. It
> >>>>>> just has a piece of metadata saying something like "every value in
> >>>>>> this field is multiplied by 1 million" and leaves it to the client
> >>>>>> program to do that multiplying.
> >>>>>>
> >>>>>> My advice is to create a good fixed decimal type and lean on it
> >> heavily.
> >>>>>>
> >>>>>> Julian
> >>>>>>
> >>>>>
> >>>>>
> >>
>
>


-- 
Julien

Re: Timestamps with different precision / Timedeltas

Posted by Donald Foss <do...@gmail.com>.

+1 for nano or milli, or something else? 

TL;DR;

epochMilli++

—

Wes, the hierarchy is eminently reasonable, so +1 from me for that.  Regarding your aside, I am also a fan of the http://speleotrove.com/decimal/decarith.html <http://speleotrove.com/decimal/decarith.html> specification, though I must admit I am biased simply because it addresses the Rexx Lost Digits condition.

The most commonly used timestamps I see are stored as epoch milliseconds, or epochMillis.  It may not be canonical, however there are many billions of devices and software applications utilizing it.

To support extremely fine grained DateTime representations, particularly in common scientific applications, I’m for _epochNano_, with logical casting to work with existing datasets that are in epochMilli instead.  We can deal with the rollover in 300k years.

While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z, I doubt it will ever happen. No, I’m not a millennial.

My only concern is for use of 64-bit logical DateTime at the small Physics level.  For that use case, UT2 is more appropriate; measurements are frequently in fractions of nanoseconds.  Perhaps there could be a way to logically cast a signed int96, which is supported by Parquet.

Timestamp [logical type]
extends FixedDecimal [logical type] (int64)
extends FixedWidth [physical type] byteArray[8]

Timestamp96 [logical type]
extends FixedDecimal [logical type] (int96)
extends FixedWidth [physical type] byteArray[12]

—

Although inappurtenant to this specific discussion, I would like to see a standardized DateTime specification that uses a signed int64 as the decimal epochSecond and an unsigned int96 as the fractional representation of a second.

TimestampHiggs [logical type]
extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of 2 columns, the fixed decimal epochSecond and the fractional second as (n/2^96).
extends FixedWidth [physical type] byteArray[8], byteArray[12]

—Donald

> On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <ja...@apache.org> wrote:
> 
> +1
> 
> On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <we...@gmail.com> wrote:
> 
>> hello,
>> 
>> For the current iteration of Arrow, can we agree to support int64 UNIX
>> timestamps with a particular resolution (second through nanosecond),
>> as these are reasonably common representations? We can look to expand
>> later if it is needed.
>> 
>> Thanks
>> Wes
>> 
>> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <we...@gmail.com> wrote:
>>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
>>> purposes of moving data between systems, at minimum) we should propose
>>> timestamp metadata and physical memory representation that maximizes
>>> interoperability with other systems. It seems like a fixed decimal
>>> would meet this requirement as UNIX-like timestamps at some resolution
>>> could pass unmodified with appropriate metadata.
>>> 
>>> We will also need decimal types in Arrow (at least to accommodate
>>> common database representations and file formats like Parquet), so
>>> this seems like a reasonable potential hierarchy of types:
>>> 
>>> Timestamp [logical type]
>>> extends FixedDecimal [logical type]
>>> extends FixedWidth [physical type]
>>> 
>>> I did a bit of internet searching but did not find a canonical
>>> reference or implementation of fixed decimals; that would be helpful.
>>> 
>>> As an aside: for floating decimal numbers for numerical data we could
>>> utilize an implementation like http://www.bytereef.org/mpdecimal/
>>> which implements the spec described at
>>> http://speleotrove.com/decimal/decarith.html
>>> 
>>> Thanks
>>> Wes
>>> 
>>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <al...@alexsamuel.net>
>> wrote:
>>>> Hi all,
>>>> 
>>>> May I suggest that instead of fixed-point decimals, you consider a more
>>>> general fixed-denominator rational representation, for times and other
>>>> purposes? Powers of ten are convenient for humans, but powers of two
>> more
>>>> efficient. For some applications, the efficiency of bit operations over
>>>> divmod is more useful than an exact representation of integral
>> nanoseconds.
>>>> 
>>>> std::chrono takes this approach. I'll also humbly point you at my own
>>>> date/time library, https://github.com/alexhsamuel/cron (incomplete but
>>>> basically working), which may provide ideas or useful code. It was
>> intended
>>>> for precisely this sort of application.
>>>> 
>>>> Regards,
>>>> Alex
>>>> 
>>>> 
>>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
>>>> 
>>>>> I agree with that having a Decimal type for timestamps is a nice
>>>>> definition. Haying your time encoded as seconds or nanoseconds should
>> be
>>>>> the same as having a scale of the respective amount. But I would rather
>>>>> avoid having a separate decimal physical type. Therefore I'd prefer the
>>>>> parquet approach where decimal is only a logical type and backed by
>>>>> either a bytearray, int32 or int64.
>>>>> 
>>>>> Thus a more general timestamp could look like:
>>>>> 
>>>>> * Decimals are logical types, physical types are the same as defined in
>>>>> Parquet [1]
>>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>>>> nanoseconds by using a different scale. .(Note that seconds and so on
>>>>> are all powers of ten, thus matching the specification of decimal scale
>>>>> really good).
>>>>> * Timestamp is just another logical type that is referring to Decimal
>>>>> (and optionally may have a timezone) and signalling that we have a Time
>>>>> and not just a "simple" decimal.
>>>>> * For a first iteration, I would assume no timezone or UTC but not
>>>>> include a metadata field. Once we're sure the implementation works, we
>>>>> can add metadata about it.
>>>>> 
>>>>> Timedeltas could be addressed in a similar way, just without the need
>>>>> for a timezone.
>>>>> 
>>>>> For my usages, I don't have the use-case for a larger than int64
>>>>> timestamp and would like to have it exactly as such in my computation,
>>>>> thus my preference for the Parquet way.
>>>>> 
>>>>> Uwe
>>>>> 
>>>>> [1]
>>>>> 
>>>>> https://github.com/apache/parquet-format/blob/master/
>> LogicalTypes.md#decimal
>>>>> 
>>>>> On 13.07.16 03:06, Julian Hyde wrote:
>>>>>> I'm talking about a fixed decimal type, not floating decimal. (Oracle
>>>>>> numbers are floating decimal. They have a few nice properties, but
>>>>>> they are variable width and can get quite large. I've seen one or two
>>>>>> systems that started with binary flo
>>>> 
>>>> 
>>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>>> 
>>>> nanoseconds by using a different scale. .(Note that seconds and so on
>>>> 
>>>> are all powers of ten, thus matching the specification of decimal scale
>>>> 
>>>> really good).
>>>> 
>>>> * Timestamp is just another logical type that is referring to Decimal
>>>> 
>>>> (and optionally may have a timezone) and signalling that we have a Tim
>>>> 
>>>> ating point numbers, which are
>>>>>> much worse for business computing, and then change to Java
>> BigDecimal,
>>>>>> which gives the right answer but are horribly inefficient.)
>>>>>> 
>>>>>> A fixed decimal type has virtually zero computational overhead. It
>>>>>> just has a piece of metadata saying something like "every value in
>>>>>> this field is multiplied by 1 million" and leaves it to the client
>>>>>> program to do that multiplying.
>>>>>> 
>>>>>> My advice is to create a good fixed decimal type and lean on it
>> heavily.
>>>>>> 
>>>>>> Julian
>>>>>> 
>>>>> 
>>>>> 
>>

Re: Timestamps with different precision / Timedeltas

Posted by Jacques Nadeau <ja...@apache.org>.

+1

On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <we...@gmail.com> wrote:

> hello,
>
> For the current iteration of Arrow, can we agree to support int64 UNIX
> timestamps with a particular resolution (second through nanosecond),
> as these are reasonably common representations? We can look to expand
> later if it is needed.
>
> Thanks
> Wes
>
> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <we...@gmail.com> wrote:
> > Bumping this discussion. As part of finalizing a v1 Arrow spec (for
> > purposes of moving data between systems, at minimum) we should propose
> > timestamp metadata and physical memory representation that maximizes
> > interoperability with other systems. It seems like a fixed decimal
> > would meet this requirement as UNIX-like timestamps at some resolution
> > could pass unmodified with appropriate metadata.
> >
> > We will also need decimal types in Arrow (at least to accommodate
> > common database representations and file formats like Parquet), so
> > this seems like a reasonable potential hierarchy of types:
> >
> > Timestamp [logical type]
> > extends FixedDecimal [logical type]
> > extends FixedWidth [physical type]
> >
> > I did a bit of internet searching but did not find a canonical
> > reference or implementation of fixed decimals; that would be helpful.
> >
> > As an aside: for floating decimal numbers for numerical data we could
> > utilize an implementation like http://www.bytereef.org/mpdecimal/
> > which implements the spec described at
> > http://speleotrove.com/decimal/decarith.html
> >
> > Thanks
> > Wes
> >
> > On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <al...@alexsamuel.net>
> wrote:
> >> Hi all,
> >>
> >> May I suggest that instead of fixed-point decimals, you consider a more
> >> general fixed-denominator rational representation, for times and other
> >> purposes? Powers of ten are convenient for humans, but powers of two
> more
> >> efficient. For some applications, the efficiency of bit operations over
> >> divmod is more useful than an exact representation of integral
> nanoseconds.
> >>
> >> std::chrono takes this approach. I'll also humbly point you at my own
> >> date/time library, https://github.com/alexhsamuel/cron (incomplete but
> >> basically working), which may provide ideas or useful code. It was
> intended
> >> for precisely this sort of application.
> >>
> >> Regards,
> >> Alex
> >>
> >>
> >> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
> >>
> >>> I agree with that having a Decimal type for timestamps is a nice
> >>> definition. Haying your time encoded as seconds or nanoseconds should
> be
> >>> the same as having a scale of the respective amount. But I would rather
> >>> avoid having a separate decimal physical type. Therefore I'd prefer the
> >>> parquet approach where decimal is only a logical type and backed by
> >>> either a bytearray, int32 or int64.
> >>>
> >>> Thus a more general timestamp could look like:
> >>>
> >>> * Decimals are logical types, physical types are the same as defined in
> >>> Parquet [1]
> >>> * Base unit for timestamps is seconds, you can get milliseconds and
> >>> nanoseconds by using a different scale. .(Note that seconds and so on
> >>> are all powers of ten, thus matching the specification of decimal scale
> >>> really good).
> >>> * Timestamp is just another logical type that is referring to Decimal
> >>> (and optionally may have a timezone) and signalling that we have a Time
> >>> and not just a "simple" decimal.
> >>> * For a first iteration, I would assume no timezone or UTC but not
> >>> include a metadata field. Once we're sure the implementation works, we
> >>> can add metadata about it.
> >>>
> >>> Timedeltas could be addressed in a similar way, just without the need
> >>> for a timezone.
> >>>
> >>> For my usages, I don't have the use-case for a larger than int64
> >>> timestamp and would like to have it exactly as such in my computation,
> >>> thus my preference for the Parquet way.
> >>>
> >>> Uwe
> >>>
> >>> [1]
> >>>
> >>> https://github.com/apache/parquet-format/blob/master/
> LogicalTypes.md#decimal
> >>>
> >>> On 13.07.16 03:06, Julian Hyde wrote:
> >>> > I'm talking about a fixed decimal type, not floating decimal. (Oracle
> >>> > numbers are floating decimal. They have a few nice properties, but
> >>> > they are variable width and can get quite large. I've seen one or two
> >>> > systems that started with binary flo
> >>
> >>
> >>> * Base unit for timestamps is seconds, you can get milliseconds and
> >>
> >> nanoseconds by using a different scale. .(Note that seconds and so on
> >>
> >> are all powers of ten, thus matching the specification of decimal scale
> >>
> >> really good).
> >>
> >> * Timestamp is just another logical type that is referring to Decimal
> >>
> >> (and optionally may have a timezone) and signalling that we have a Tim
> >>
> >> ating point numbers, which are
> >>> > much worse for business computing, and then change to Java
> BigDecimal,
> >>> > which gives the right answer but are horribly inefficient.)
> >>> >
> >>> > A fixed decimal type has virtually zero computational overhead. It
> >>> > just has a piece of metadata saying something like "every value in
> >>> > this field is multiplied by 1 million" and leaves it to the client
> >>> > program to do that multiplying.
> >>> >
> >>> > My advice is to create a good fixed decimal type and lean on it
> heavily.
> >>> >
> >>> > Julian
> >>> >
> >>>
> >>>
>

Re: Timestamps with different precision / Timedeltas

Posted by Wes McKinney <we...@gmail.com>.

hello,

For the current iteration of Arrow, can we agree to support int64 UNIX
timestamps with a particular resolution (second through nanosecond),
as these are reasonably common representations? We can look to expand
later if it is needed.

Thanks
Wes

On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <we...@gmail.com> wrote:
> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
> purposes of moving data between systems, at minimum) we should propose
> timestamp metadata and physical memory representation that maximizes
> interoperability with other systems. It seems like a fixed decimal
> would meet this requirement as UNIX-like timestamps at some resolution
> could pass unmodified with appropriate metadata.
>
> We will also need decimal types in Arrow (at least to accommodate
> common database representations and file formats like Parquet), so
> this seems like a reasonable potential hierarchy of types:
>
> Timestamp [logical type]
> extends FixedDecimal [logical type]
> extends FixedWidth [physical type]
>
> I did a bit of internet searching but did not find a canonical
> reference or implementation of fixed decimals; that would be helpful.
>
> As an aside: for floating decimal numbers for numerical data we could
> utilize an implementation like http://www.bytereef.org/mpdecimal/
> which implements the spec described at
> http://speleotrove.com/decimal/decarith.html
>
> Thanks
> Wes
>
> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <al...@alexsamuel.net> wrote:
>> Hi all,
>>
>> May I suggest that instead of fixed-point decimals, you consider a more
>> general fixed-denominator rational representation, for times and other
>> purposes? Powers of ten are convenient for humans, but powers of two more
>> efficient. For some applications, the efficiency of bit operations over
>> divmod is more useful than an exact representation of integral nanoseconds.
>>
>> std::chrono takes this approach. I'll also humbly point you at my own
>> date/time library, https://github.com/alexhsamuel/cron (incomplete but
>> basically working), which may provide ideas or useful code. It was intended
>> for precisely this sort of application.
>>
>> Regards,
>> Alex
>>
>>
>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
>>
>>> I agree with that having a Decimal type for timestamps is a nice
>>> definition. Haying your time encoded as seconds or nanoseconds should be
>>> the same as having a scale of the respective amount. But I would rather
>>> avoid having a separate decimal physical type. Therefore I'd prefer the
>>> parquet approach where decimal is only a logical type and backed by
>>> either a bytearray, int32 or int64.
>>>
>>> Thus a more general timestamp could look like:
>>>
>>> * Decimals are logical types, physical types are the same as defined in
>>> Parquet [1]
>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>> nanoseconds by using a different scale. .(Note that seconds and so on
>>> are all powers of ten, thus matching the specification of decimal scale
>>> really good).
>>> * Timestamp is just another logical type that is referring to Decimal
>>> (and optionally may have a timezone) and signalling that we have a Time
>>> and not just a "simple" decimal.
>>> * For a first iteration, I would assume no timezone or UTC but not
>>> include a metadata field. Once we're sure the implementation works, we
>>> can add metadata about it.
>>>
>>> Timedeltas could be addressed in a similar way, just without the need
>>> for a timezone.
>>>
>>> For my usages, I don't have the use-case for a larger than int64
>>> timestamp and would like to have it exactly as such in my computation,
>>> thus my preference for the Parquet way.
>>>
>>> Uwe
>>>
>>> [1]
>>>
>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
>>>
>>> On 13.07.16 03:06, Julian Hyde wrote:
>>> > I'm talking about a fixed decimal type, not floating decimal. (Oracle
>>> > numbers are floating decimal. They have a few nice properties, but
>>> > they are variable width and can get quite large. I've seen one or two
>>> > systems that started with binary flo
>>
>>
>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>
>> nanoseconds by using a different scale. .(Note that seconds and so on
>>
>> are all powers of ten, thus matching the specification of decimal scale
>>
>> really good).
>>
>> * Timestamp is just another logical type that is referring to Decimal
>>
>> (and optionally may have a timezone) and signalling that we have a Tim
>>
>> ating point numbers, which are
>>> > much worse for business computing, and then change to Java BigDecimal,
>>> > which gives the right answer but are horribly inefficient.)
>>> >
>>> > A fixed decimal type has virtually zero computational overhead. It
>>> > just has a piece of metadata saying something like "every value in
>>> > this field is multiplied by 1 million" and leaves it to the client
>>> > program to do that multiplying.
>>> >
>>> > My advice is to create a good fixed decimal type and lean on it heavily.
>>> >
>>> > Julian
>>> >
>>>
>>>

Re: Timestamps with different precision / Timedeltas

Posted by Wes McKinney <we...@gmail.com>.

Bumping this discussion. As part of finalizing a v1 Arrow spec (for
purposes of moving data between systems, at minimum) we should propose
timestamp metadata and physical memory representation that maximizes
interoperability with other systems. It seems like a fixed decimal
would meet this requirement as UNIX-like timestamps at some resolution
could pass unmodified with appropriate metadata.

We will also need decimal types in Arrow (at least to accommodate
common database representations and file formats like Parquet), so
this seems like a reasonable potential hierarchy of types:

Timestamp [logical type]
extends FixedDecimal [logical type]
extends FixedWidth [physical type]

I did a bit of internet searching but did not find a canonical
reference or implementation of fixed decimals; that would be helpful.

As an aside: for floating decimal numbers for numerical data we could
utilize an implementation like http://www.bytereef.org/mpdecimal/
which implements the spec described at
http://speleotrove.com/decimal/decarith.html

Thanks
Wes

On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <al...@alexsamuel.net> wrote:
> Hi all,
>
> May I suggest that instead of fixed-point decimals, you consider a more
> general fixed-denominator rational representation, for times and other
> purposes? Powers of ten are convenient for humans, but powers of two more
> efficient. For some applications, the efficiency of bit operations over
> divmod is more useful than an exact representation of integral nanoseconds.
>
> std::chrono takes this approach. I'll also humbly point you at my own
> date/time library, https://github.com/alexhsamuel/cron (incomplete but
> basically working), which may provide ideas or useful code. It was intended
> for precisely this sort of application.
>
> Regards,
> Alex
>
>
> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
>
>> I agree with that having a Decimal type for timestamps is a nice
>> definition. Haying your time encoded as seconds or nanoseconds should be
>> the same as having a scale of the respective amount. But I would rather
>> avoid having a separate decimal physical type. Therefore I'd prefer the
>> parquet approach where decimal is only a logical type and backed by
>> either a bytearray, int32 or int64.
>>
>> Thus a more general timestamp could look like:
>>
>> * Decimals are logical types, physical types are the same as defined in
>> Parquet [1]
>> * Base unit for timestamps is seconds, you can get milliseconds and
>> nanoseconds by using a different scale. .(Note that seconds and so on
>> are all powers of ten, thus matching the specification of decimal scale
>> really good).
>> * Timestamp is just another logical type that is referring to Decimal
>> (and optionally may have a timezone) and signalling that we have a Time
>> and not just a "simple" decimal.
>> * For a first iteration, I would assume no timezone or UTC but not
>> include a metadata field. Once we're sure the implementation works, we
>> can add metadata about it.
>>
>> Timedeltas could be addressed in a similar way, just without the need
>> for a timezone.
>>
>> For my usages, I don't have the use-case for a larger than int64
>> timestamp and would like to have it exactly as such in my computation,
>> thus my preference for the Parquet way.
>>
>> Uwe
>>
>> [1]
>>
>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
>>
>> On 13.07.16 03:06, Julian Hyde wrote:
>> > I'm talking about a fixed decimal type, not floating decimal. (Oracle
>> > numbers are floating decimal. They have a few nice properties, but
>> > they are variable width and can get quite large. I've seen one or two
>> > systems that started with binary flo
>
>
>> * Base unit for timestamps is seconds, you can get milliseconds and
>
> nanoseconds by using a different scale. .(Note that seconds and so on
>
> are all powers of ten, thus matching the specification of decimal scale
>
> really good).
>
> * Timestamp is just another logical type that is referring to Decimal
>
> (and optionally may have a timezone) and signalling that we have a Tim
>
> ating point numbers, which are
>> > much worse for business computing, and then change to Java BigDecimal,
>> > which gives the right answer but are horribly inefficient.)
>> >
>> > A fixed decimal type has virtually zero computational overhead. It
>> > just has a piece of metadata saying something like "every value in
>> > this field is multiplied by 1 million" and leaves it to the client
>> > program to do that multiplying.
>> >
>> > My advice is to create a good fixed decimal type and lean on it heavily.
>> >
>> > Julian
>> >
>>
>>

Re: Timestamps with different precision / Timedeltas

Posted by Alex Samuel <al...@alexsamuel.net>.

Hi all,

May I suggest that instead of fixed-point decimals, you consider a more
general fixed-denominator rational representation, for times and other
purposes? Powers of ten are convenient for humans, but powers of two more
efficient. For some applications, the efficiency of bit operations over
divmod is more useful than an exact representation of integral nanoseconds.

std::chrono takes this approach. I'll also humbly point you at my own
date/time library, https://github.com/alexhsamuel/cron (incomplete but
basically working), which may provide ideas or useful code. It was intended
for precisely this sort of application.

Regards,
Alex


On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:

> I agree with that having a Decimal type for timestamps is a nice
> definition. Haying your time encoded as seconds or nanoseconds should be
> the same as having a scale of the respective amount. But I would rather
> avoid having a separate decimal physical type. Therefore I'd prefer the
> parquet approach where decimal is only a logical type and backed by
> either a bytearray, int32 or int64.
>
> Thus a more general timestamp could look like:
>
> * Decimals are logical types, physical types are the same as defined in
> Parquet [1]
> * Base unit for timestamps is seconds, you can get milliseconds and
> nanoseconds by using a different scale. .(Note that seconds and so on
> are all powers of ten, thus matching the specification of decimal scale
> really good).
> * Timestamp is just another logical type that is referring to Decimal
> (and optionally may have a timezone) and signalling that we have a Time
> and not just a "simple" decimal.
> * For a first iteration, I would assume no timezone or UTC but not
> include a metadata field. Once we're sure the implementation works, we
> can add metadata about it.
>
> Timedeltas could be addressed in a similar way, just without the need
> for a timezone.
>
> For my usages, I don't have the use-case for a larger than int64
> timestamp and would like to have it exactly as such in my computation,
> thus my preference for the Parquet way.
>
> Uwe
>
> [1]
>
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
>
> On 13.07.16 03:06, Julian Hyde wrote:
> > I'm talking about a fixed decimal type, not floating decimal. (Oracle
> > numbers are floating decimal. They have a few nice properties, but
> > they are variable width and can get quite large. I've seen one or two
> > systems that started with binary flo


> * Base unit for timestamps is seconds, you can get milliseconds and

nanoseconds by using a different scale. .(Note that seconds and so on

are all powers of ten, thus matching the specification of decimal scale

really good).

* Timestamp is just another logical type that is referring to Decimal

(and optionally may have a timezone) and signalling that we have a Tim

ating point numbers, which are
> > much worse for business computing, and then change to Java BigDecimal,
> > which gives the right answer but are horribly inefficient.)
> >
> > A fixed decimal type has virtually zero computational overhead. It
> > just has a piece of metadata saying something like "every value in
> > this field is multiplied by 1 million" and leaves it to the client
> > program to do that multiplying.
> >
> > My advice is to create a good fixed decimal type and lean on it heavily.
> >
> > Julian
> >
>
>

Re: Timestamps with different precision / Timedeltas

Posted by Uwe Korn <uw...@xhochy.com>.

I agree with that having a Decimal type for timestamps is a nice 
definition. Haying your time encoded as seconds or nanoseconds should be 
the same as having a scale of the respective amount. But I would rather 
avoid having a separate decimal physical type. Therefore I'd prefer the 
parquet approach where decimal is only a logical type and backed by 
either a bytearray, int32 or int64.

Thus a more general timestamp could look like:

* Decimals are logical types, physical types are the same as defined in 
Parquet [1]
* Base unit for timestamps is seconds, you can get milliseconds and 
nanoseconds by using a different scale. .(Note that seconds and so on 
are all powers of ten, thus matching the specification of decimal scale 
really good).
* Timestamp is just another logical type that is referring to Decimal 
(and optionally may have a timezone) and signalling that we have a Time 
and not just a "simple" decimal.
* For a first iteration, I would assume no timezone or UTC but not 
include a metadata field. Once we're sure the implementation works, we 
can add metadata about it.

Timedeltas could be addressed in a similar way, just without the need 
for a timezone.

For my usages, I don't have the use-case for a larger than int64 
timestamp and would like to have it exactly as such in my computation, 
thus my preference for the Parquet way.

Uwe

[1] 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

On 13.07.16 03:06, Julian Hyde wrote:
> I'm talking about a fixed decimal type, not floating decimal. (Oracle
> numbers are floating decimal. They have a few nice properties, but
> they are variable width and can get quite large. I've seen one or two
> systems that started with binary floating point numbers, which are
> much worse for business computing, and then change to Java BigDecimal,
> which gives the right answer but are horribly inefficient.)
>
> A fixed decimal type has virtually zero computational overhead. It
> just has a piece of metadata saying something like "every value in
> this field is multiplied by 1 million" and leaves it to the client
> program to do that multiplying.
>
> My advice is to create a good fixed decimal type and lean on it heavily.
>
> Julian
>
>
> On Tue, Jul 12, 2016 at 5:46 PM, Jacques Nadeau <ja...@apache.org> wrote:
>> Julian has some experience with the Oracle internals where the perfect
>> numeric type solves many problems...  :D
>>
>>
>>
>> On Tue, Jul 12, 2016 at 5:43 PM, Wes McKinney <we...@gmail.com> wrote:
>>
>>> As one data point, none of the systems I work with use decimals for
>>> representing timestamps (UNIX timestamps at some resolution, second /
>>> milli / nano, is the most common), so having decimal as the default
>>> storage class would cause a computational hardship. We may consider
>>> incorporating the Timestamp storage type into the canonical metadata.
>>>
>>> - Wes
>>>
>>> On Tue, Jul 5, 2016 at 4:21 PM, Wes McKinney <we...@gmail.com> wrote:
>>>> Is it worth doing a review of different file formats and database
>>>> systems to decide on a timestamp implementation (int64 or int96 with
>>>> some resolution seems to be quite popular as well)? At least in the
>>>> Arrow C++ codebase, we need to add decimal handling logic anyway.
>>>>
>>>> On Mon, Jun 27, 2016 at 5:20 PM, Julian Hyde <jh...@apache.org> wrote:
>>>>> SQL allows timestamps to be stored with any precision (i.e. number of
>>> digits after the decimal point) between 0 and 9. That strongly indicates to
>>> me that the right implementation of timestamps is as (fixed point) decimal
>>> values.
>>>>> Then devote your efforts to getting the decimal type working correctly.
>>>>>
>>>>>
>>>>>> On Jun 27, 2016, at 3:16 PM, Wes McKinney <we...@gmail.com> wrote:
>>>>>>
>>>>>> hi Uwe,
>>>>>>
>>>>>> Thanks for bringing this up. So far we've largely been skirting the
>>>>>> "Logical Types Rabbit Hole", but it would be good to start a document
>>>>>> collecting requirements for various logical types (e.g. timestamps) so
>>>>>> that we can attempt to achieve good solutions on the first try based
>>>>>> on the experiences (good and bad) of other projects.
>>>>>>
>>>>>> In the IPC flatbuffers metadata spec that we drafted for discussion /
>>>>>> prototype implementation earlier this year [1], we do have a Timestamp
>>>>>> logical type containing only a timezone optional field [2]. If you
>>>>>> contrast this with Feather (which uses Arrow's physical memory layout,
>>>>>> but custom metadata to suit Python/R needs), that has both a unit and
>>>>>> timezone [3].
>>>>>>
>>>>>> Since there is little consensus in the units of timestamps (more
>>>>>> consensus around the UNIX 1970-01-01 epoch, but not even 100%
>>>>>> uniformity), I believe the best route would be to add a unit to the
>>>>>> metadata to indicates second through nanosecond resolution. Same goes
>>>>>> for a Time type.
>>>>>>
>>>>>> For example, Parquet has both milliseconds and microseconds (in
>>>>>> Parquet 2.0). But earlier versions of Parquet don't have this at all
>>>>>> [4]. Other systems like Hive and Impala are relying on their own table
>>>>>> metadata to convert back and forth (e.g. embedding timestamps of
>>>>>> whatever resolution in int64 or int96).
>>>>>>
>>>>>> For Python pandas that want to use Parquet files (via Arrow) in their
>>>>>> workflow, we're stuck with a couple options:
>>>>>>
>>>>>> 1) Drop sub-microsecond nanos and store timestamps as TIMESTAMP_MICROS
>>>>>> (or MILLIS? Not all Parquet readers may be aware of the new
>>>>>> microsecond ConvertedType)
>>>>>> 2) Store nanosecond timestamps as INT64 and add a bespoke entry to
>>>>>> ColumnMetaData::key_value_metadata (it's better than nothing?).
>>>>>>
>>>>>> I see use cases for both of these -- for Option 1, you may care about
>>>>>> interoperability with another system that uses Parquet. For Option 2,
>>>>>> you may care about preserving the fidelity of your pandas data.
>>>>>> Realistically, #1 seems like the best default option. It makes sense
>>>>>> to offer #2 as an option.
>>>>>>
>>>>>> I don't think addressing time zones in the first pass is strictly
>>>>>> necessary, but as long as we store timestamps as UTC, we can also put
>>>>>> the time zone in the KeyValue metadata.
>>>>>>
>>>>>> I'm not sure about the Interval type -- let's create a JIRA and tackle
>>>>>> that in a separate discussion. I agree that it merits inclusion as a
>>>>>> logical type, but I'm not sure what storage representation makes the
>>>>>> most sense (e.g. is is not clear to me why Parquet does not store the
>>>>>> interval as an absolute number of milliseconds; perhaps to accommodate
>>>>>> month-based intervals which may have different absolute lengths
>>>>>> depending on where you start).
>>>>>>
>>>>>> Let me know what you think, and if others have thoughts I'd be
>>> interested too.
>>>>>> thanks,
>>>>>> Wes
>>>>>>
>>>>>> [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs
>>>>>> [2] :
>>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L51
>>>>>> [3]:
>>> https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs#L78
>>>>>> [4]:
>>> https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift
>>>>>> On Tue, Jun 21, 2016 at 1:40 PM, Uwe Korn <uw...@xhochy.com> wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> in addition to categoricals, we also miss at the moment a conversion
>>> from
>>>>>>> Timestamps in Pandas/NumPy to Arrow. Currently we only have two
>>> (exact)
>>>>>>> resolutions for them: DATE for days and TIMESTAMP for milliseconds. As
>>>>>>> https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html
>>> notes there
>>>>>>> are several more. We do not need to cater for all but at least some
>>> of them.
>>>>>>> Therefore I have the following questions which I like to have solved
>>> in some
>>>>>>> form before implementing:
>>>>>>>
>>>>>>> * Do we want to cater for other resolutions?
>>>>>>> * If we do not provide, e.g. nanosecond resolution (sadly the default
>>>>>>>    in Pandas), do we cast with precision loss to the nearest match? Or
>>>>>>>    should we force the user to do it?
>>>>>>> * Not so important for me at the moment: Do we want to support time
>>> zones?
>>>>>>> My current objective is to have them for Parquet file writing. Sadly
>>> this
>>>>>>> has the same limitations. So the two main options seem to be
>>>>>>>
>>>>>>> * "roundtrip will only yield correct timezone and logical type if we
>>>>>>>    read with Arrow/Pandas again (as we use "proprietary" metadata to
>>>>>>>    encode it)"
>>>>>>> * "we restrict us to milliseconds and days as resolution" (for the
>>>>>>>    latter option, we need to decide how graceful we want to be in the
>>>>>>>    Pandas<->Arrow conversion).
>>>>>>>
>>>>>>> Further datatype we have not yet in Arrow but partly in Parquet is
>>> timedelta
>>>>>>> (or INTERVAL in Parquet). Probably we need to add another logical
>>> type to
>>>>>>> Arrow to implement them. Open for suggestions here, too.
>>>>>>>
>>>>>>> Also in the Arrow spec there is TIME which seems to be the same as
>>> TIMESTAMP
>>>>>>> (as far as the comments in the C++ code goes). Is there maybe some
>>>>>>> distinction I'm missing?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> Uwe
>>>>>>>

Re: Timestamps with different precision / Timedeltas

Posted by Julian Hyde <jh...@apache.org>.

I'm talking about a fixed decimal type, not floating decimal. (Oracle
numbers are floating decimal. They have a few nice properties, but
they are variable width and can get quite large. I've seen one or two
systems that started with binary floating point numbers, which are
much worse for business computing, and then change to Java BigDecimal,
which gives the right answer but are horribly inefficient.)

A fixed decimal type has virtually zero computational overhead. It
just has a piece of metadata saying something like "every value in
this field is multiplied by 1 million" and leaves it to the client
program to do that multiplying.

My advice is to create a good fixed decimal type and lean on it heavily.

Julian


On Tue, Jul 12, 2016 at 5:46 PM, Jacques Nadeau <ja...@apache.org> wrote:
> Julian has some experience with the Oracle internals where the perfect
> numeric type solves many problems...  :D
>
>
>
> On Tue, Jul 12, 2016 at 5:43 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> As one data point, none of the systems I work with use decimals for
>> representing timestamps (UNIX timestamps at some resolution, second /
>> milli / nano, is the most common), so having decimal as the default
>> storage class would cause a computational hardship. We may consider
>> incorporating the Timestamp storage type into the canonical metadata.
>>
>> - Wes
>>
>> On Tue, Jul 5, 2016 at 4:21 PM, Wes McKinney <we...@gmail.com> wrote:
>> > Is it worth doing a review of different file formats and database
>> > systems to decide on a timestamp implementation (int64 or int96 with
>> > some resolution seems to be quite popular as well)? At least in the
>> > Arrow C++ codebase, we need to add decimal handling logic anyway.
>> >
>> > On Mon, Jun 27, 2016 at 5:20 PM, Julian Hyde <jh...@apache.org> wrote:
>> >> SQL allows timestamps to be stored with any precision (i.e. number of
>> digits after the decimal point) between 0 and 9. That strongly indicates to
>> me that the right implementation of timestamps is as (fixed point) decimal
>> values.
>> >>
>> >> Then devote your efforts to getting the decimal type working correctly.
>> >>
>> >>
>> >>> On Jun 27, 2016, at 3:16 PM, Wes McKinney <we...@gmail.com> wrote:
>> >>>
>> >>> hi Uwe,
>> >>>
>> >>> Thanks for bringing this up. So far we've largely been skirting the
>> >>> "Logical Types Rabbit Hole", but it would be good to start a document
>> >>> collecting requirements for various logical types (e.g. timestamps) so
>> >>> that we can attempt to achieve good solutions on the first try based
>> >>> on the experiences (good and bad) of other projects.
>> >>>
>> >>> In the IPC flatbuffers metadata spec that we drafted for discussion /
>> >>> prototype implementation earlier this year [1], we do have a Timestamp
>> >>> logical type containing only a timezone optional field [2]. If you
>> >>> contrast this with Feather (which uses Arrow's physical memory layout,
>> >>> but custom metadata to suit Python/R needs), that has both a unit and
>> >>> timezone [3].
>> >>>
>> >>> Since there is little consensus in the units of timestamps (more
>> >>> consensus around the UNIX 1970-01-01 epoch, but not even 100%
>> >>> uniformity), I believe the best route would be to add a unit to the
>> >>> metadata to indicates second through nanosecond resolution. Same goes
>> >>> for a Time type.
>> >>>
>> >>> For example, Parquet has both milliseconds and microseconds (in
>> >>> Parquet 2.0). But earlier versions of Parquet don't have this at all
>> >>> [4]. Other systems like Hive and Impala are relying on their own table
>> >>> metadata to convert back and forth (e.g. embedding timestamps of
>> >>> whatever resolution in int64 or int96).
>> >>>
>> >>> For Python pandas that want to use Parquet files (via Arrow) in their
>> >>> workflow, we're stuck with a couple options:
>> >>>
>> >>> 1) Drop sub-microsecond nanos and store timestamps as TIMESTAMP_MICROS
>> >>> (or MILLIS? Not all Parquet readers may be aware of the new
>> >>> microsecond ConvertedType)
>> >>> 2) Store nanosecond timestamps as INT64 and add a bespoke entry to
>> >>> ColumnMetaData::key_value_metadata (it's better than nothing?).
>> >>>
>> >>> I see use cases for both of these -- for Option 1, you may care about
>> >>> interoperability with another system that uses Parquet. For Option 2,
>> >>> you may care about preserving the fidelity of your pandas data.
>> >>> Realistically, #1 seems like the best default option. It makes sense
>> >>> to offer #2 as an option.
>> >>>
>> >>> I don't think addressing time zones in the first pass is strictly
>> >>> necessary, but as long as we store timestamps as UTC, we can also put
>> >>> the time zone in the KeyValue metadata.
>> >>>
>> >>> I'm not sure about the Interval type -- let's create a JIRA and tackle
>> >>> that in a separate discussion. I agree that it merits inclusion as a
>> >>> logical type, but I'm not sure what storage representation makes the
>> >>> most sense (e.g. is is not clear to me why Parquet does not store the
>> >>> interval as an absolute number of milliseconds; perhaps to accommodate
>> >>> month-based intervals which may have different absolute lengths
>> >>> depending on where you start).
>> >>>
>> >>> Let me know what you think, and if others have thoughts I'd be
>> interested too.
>> >>>
>> >>> thanks,
>> >>> Wes
>> >>>
>> >>> [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs
>> >>> [2] :
>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L51
>> >>> [3]:
>> https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs#L78
>> >>> [4]:
>> https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift
>> >>>
>> >>> On Tue, Jun 21, 2016 at 1:40 PM, Uwe Korn <uw...@xhochy.com> wrote:
>> >>>> Hello,
>> >>>>
>> >>>> in addition to categoricals, we also miss at the moment a conversion
>> from
>> >>>> Timestamps in Pandas/NumPy to Arrow. Currently we only have two
>> (exact)
>> >>>> resolutions for them: DATE for days and TIMESTAMP for milliseconds. As
>> >>>> https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html
>> notes there
>> >>>> are several more. We do not need to cater for all but at least some
>> of them.
>> >>>> Therefore I have the following questions which I like to have solved
>> in some
>> >>>> form before implementing:
>> >>>>
>> >>>> * Do we want to cater for other resolutions?
>> >>>> * If we do not provide, e.g. nanosecond resolution (sadly the default
>> >>>>   in Pandas), do we cast with precision loss to the nearest match? Or
>> >>>>   should we force the user to do it?
>> >>>> * Not so important for me at the moment: Do we want to support time
>> zones?
>> >>>>
>> >>>> My current objective is to have them for Parquet file writing. Sadly
>> this
>> >>>> has the same limitations. So the two main options seem to be
>> >>>>
>> >>>> * "roundtrip will only yield correct timezone and logical type if we
>> >>>>   read with Arrow/Pandas again (as we use "proprietary" metadata to
>> >>>>   encode it)"
>> >>>> * "we restrict us to milliseconds and days as resolution" (for the
>> >>>>   latter option, we need to decide how graceful we want to be in the
>> >>>>   Pandas<->Arrow conversion).
>> >>>>
>> >>>> Further datatype we have not yet in Arrow but partly in Parquet is
>> timedelta
>> >>>> (or INTERVAL in Parquet). Probably we need to add another logical
>> type to
>> >>>> Arrow to implement them. Open for suggestions here, too.
>> >>>>
>> >>>> Also in the Arrow spec there is TIME which seems to be the same as
>> TIMESTAMP
>> >>>> (as far as the comments in the C++ code goes). Is there maybe some
>> >>>> distinction I'm missing?
>> >>>>
>> >>>> Cheers
>> >>>>
>> >>>> Uwe
>> >>>>
>> >>
>>

Re: Timestamps with different precision / Timedeltas

Posted by Jacques Nadeau <ja...@apache.org>.

Julian has some experience with the Oracle internals where the perfect
numeric type solves many problems...  :D



On Tue, Jul 12, 2016 at 5:43 PM, Wes McKinney <we...@gmail.com> wrote:

> As one data point, none of the systems I work with use decimals for
> representing timestamps (UNIX timestamps at some resolution, second /
> milli / nano, is the most common), so having decimal as the default
> storage class would cause a computational hardship. We may consider
> incorporating the Timestamp storage type into the canonical metadata.
>
> - Wes
>
> On Tue, Jul 5, 2016 at 4:21 PM, Wes McKinney <we...@gmail.com> wrote:
> > Is it worth doing a review of different file formats and database
> > systems to decide on a timestamp implementation (int64 or int96 with
> > some resolution seems to be quite popular as well)? At least in the
> > Arrow C++ codebase, we need to add decimal handling logic anyway.
> >
> > On Mon, Jun 27, 2016 at 5:20 PM, Julian Hyde <jh...@apache.org> wrote:
> >> SQL allows timestamps to be stored with any precision (i.e. number of
> digits after the decimal point) between 0 and 9. That strongly indicates to
> me that the right implementation of timestamps is as (fixed point) decimal
> values.
> >>
> >> Then devote your efforts to getting the decimal type working correctly.
> >>
> >>
> >>> On Jun 27, 2016, at 3:16 PM, Wes McKinney <we...@gmail.com> wrote:
> >>>
> >>> hi Uwe,
> >>>
> >>> Thanks for bringing this up. So far we've largely been skirting the
> >>> "Logical Types Rabbit Hole", but it would be good to start a document
> >>> collecting requirements for various logical types (e.g. timestamps) so
> >>> that we can attempt to achieve good solutions on the first try based
> >>> on the experiences (good and bad) of other projects.
> >>>
> >>> In the IPC flatbuffers metadata spec that we drafted for discussion /
> >>> prototype implementation earlier this year [1], we do have a Timestamp
> >>> logical type containing only a timezone optional field [2]. If you
> >>> contrast this with Feather (which uses Arrow's physical memory layout,
> >>> but custom metadata to suit Python/R needs), that has both a unit and
> >>> timezone [3].
> >>>
> >>> Since there is little consensus in the units of timestamps (more
> >>> consensus around the UNIX 1970-01-01 epoch, but not even 100%
> >>> uniformity), I believe the best route would be to add a unit to the
> >>> metadata to indicates second through nanosecond resolution. Same goes
> >>> for a Time type.
> >>>
> >>> For example, Parquet has both milliseconds and microseconds (in
> >>> Parquet 2.0). But earlier versions of Parquet don't have this at all
> >>> [4]. Other systems like Hive and Impala are relying on their own table
> >>> metadata to convert back and forth (e.g. embedding timestamps of
> >>> whatever resolution in int64 or int96).
> >>>
> >>> For Python pandas that want to use Parquet files (via Arrow) in their
> >>> workflow, we're stuck with a couple options:
> >>>
> >>> 1) Drop sub-microsecond nanos and store timestamps as TIMESTAMP_MICROS
> >>> (or MILLIS? Not all Parquet readers may be aware of the new
> >>> microsecond ConvertedType)
> >>> 2) Store nanosecond timestamps as INT64 and add a bespoke entry to
> >>> ColumnMetaData::key_value_metadata (it's better than nothing?).
> >>>
> >>> I see use cases for both of these -- for Option 1, you may care about
> >>> interoperability with another system that uses Parquet. For Option 2,
> >>> you may care about preserving the fidelity of your pandas data.
> >>> Realistically, #1 seems like the best default option. It makes sense
> >>> to offer #2 as an option.
> >>>
> >>> I don't think addressing time zones in the first pass is strictly
> >>> necessary, but as long as we store timestamps as UTC, we can also put
> >>> the time zone in the KeyValue metadata.
> >>>
> >>> I'm not sure about the Interval type -- let's create a JIRA and tackle
> >>> that in a separate discussion. I agree that it merits inclusion as a
> >>> logical type, but I'm not sure what storage representation makes the
> >>> most sense (e.g. is is not clear to me why Parquet does not store the
> >>> interval as an absolute number of milliseconds; perhaps to accommodate
> >>> month-based intervals which may have different absolute lengths
> >>> depending on where you start).
> >>>
> >>> Let me know what you think, and if others have thoughts I'd be
> interested too.
> >>>
> >>> thanks,
> >>> Wes
> >>>
> >>> [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs
> >>> [2] :
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L51
> >>> [3]:
> https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs#L78
> >>> [4]:
> https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift
> >>>
> >>> On Tue, Jun 21, 2016 at 1:40 PM, Uwe Korn <uw...@xhochy.com> wrote:
> >>>> Hello,
> >>>>
> >>>> in addition to categoricals, we also miss at the moment a conversion
> from
> >>>> Timestamps in Pandas/NumPy to Arrow. Currently we only have two
> (exact)
> >>>> resolutions for them: DATE for days and TIMESTAMP for milliseconds. As
> >>>> https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html
> notes there
> >>>> are several more. We do not need to cater for all but at least some
> of them.
> >>>> Therefore I have the following questions which I like to have solved
> in some
> >>>> form before implementing:
> >>>>
> >>>> * Do we want to cater for other resolutions?
> >>>> * If we do not provide, e.g. nanosecond resolution (sadly the default
> >>>>   in Pandas), do we cast with precision loss to the nearest match? Or
> >>>>   should we force the user to do it?
> >>>> * Not so important for me at the moment: Do we want to support time
> zones?
> >>>>
> >>>> My current objective is to have them for Parquet file writing. Sadly
> this
> >>>> has the same limitations. So the two main options seem to be
> >>>>
> >>>> * "roundtrip will only yield correct timezone and logical type if we
> >>>>   read with Arrow/Pandas again (as we use "proprietary" metadata to
> >>>>   encode it)"
> >>>> * "we restrict us to milliseconds and days as resolution" (for the
> >>>>   latter option, we need to decide how graceful we want to be in the
> >>>>   Pandas<->Arrow conversion).
> >>>>
> >>>> Further datatype we have not yet in Arrow but partly in Parquet is
> timedelta
> >>>> (or INTERVAL in Parquet). Probably we need to add another logical
> type to
> >>>> Arrow to implement them. Open for suggestions here, too.
> >>>>
> >>>> Also in the Arrow spec there is TIME which seems to be the same as
> TIMESTAMP
> >>>> (as far as the comments in the C++ code goes). Is there maybe some
> >>>> distinction I'm missing?
> >>>>
> >>>> Cheers
> >>>>
> >>>> Uwe
> >>>>
> >>
>

Re: Timestamps with different precision / Timedeltas

Posted by Wes McKinney <we...@gmail.com>.

As one data point, none of the systems I work with use decimals for
representing timestamps (UNIX timestamps at some resolution, second /
milli / nano, is the most common), so having decimal as the default
storage class would cause a computational hardship. We may consider
incorporating the Timestamp storage type into the canonical metadata.

- Wes

On Tue, Jul 5, 2016 at 4:21 PM, Wes McKinney <we...@gmail.com> wrote:
> Is it worth doing a review of different file formats and database
> systems to decide on a timestamp implementation (int64 or int96 with
> some resolution seems to be quite popular as well)? At least in the
> Arrow C++ codebase, we need to add decimal handling logic anyway.
>
> On Mon, Jun 27, 2016 at 5:20 PM, Julian Hyde <jh...@apache.org> wrote:
>> SQL allows timestamps to be stored with any precision (i.e. number of digits after the decimal point) between 0 and 9. That strongly indicates to me that the right implementation of timestamps is as (fixed point) decimal values.
>>
>> Then devote your efforts to getting the decimal type working correctly.
>>
>>
>>> On Jun 27, 2016, at 3:16 PM, Wes McKinney <we...@gmail.com> wrote:
>>>
>>> hi Uwe,
>>>
>>> Thanks for bringing this up. So far we've largely been skirting the
>>> "Logical Types Rabbit Hole", but it would be good to start a document
>>> collecting requirements for various logical types (e.g. timestamps) so
>>> that we can attempt to achieve good solutions on the first try based
>>> on the experiences (good and bad) of other projects.
>>>
>>> In the IPC flatbuffers metadata spec that we drafted for discussion /
>>> prototype implementation earlier this year [1], we do have a Timestamp
>>> logical type containing only a timezone optional field [2]. If you
>>> contrast this with Feather (which uses Arrow's physical memory layout,
>>> but custom metadata to suit Python/R needs), that has both a unit and
>>> timezone [3].
>>>
>>> Since there is little consensus in the units of timestamps (more
>>> consensus around the UNIX 1970-01-01 epoch, but not even 100%
>>> uniformity), I believe the best route would be to add a unit to the
>>> metadata to indicates second through nanosecond resolution. Same goes
>>> for a Time type.
>>>
>>> For example, Parquet has both milliseconds and microseconds (in
>>> Parquet 2.0). But earlier versions of Parquet don't have this at all
>>> [4]. Other systems like Hive and Impala are relying on their own table
>>> metadata to convert back and forth (e.g. embedding timestamps of
>>> whatever resolution in int64 or int96).
>>>
>>> For Python pandas that want to use Parquet files (via Arrow) in their
>>> workflow, we're stuck with a couple options:
>>>
>>> 1) Drop sub-microsecond nanos and store timestamps as TIMESTAMP_MICROS
>>> (or MILLIS? Not all Parquet readers may be aware of the new
>>> microsecond ConvertedType)
>>> 2) Store nanosecond timestamps as INT64 and add a bespoke entry to
>>> ColumnMetaData::key_value_metadata (it's better than nothing?).
>>>
>>> I see use cases for both of these -- for Option 1, you may care about
>>> interoperability with another system that uses Parquet. For Option 2,
>>> you may care about preserving the fidelity of your pandas data.
>>> Realistically, #1 seems like the best default option. It makes sense
>>> to offer #2 as an option.
>>>
>>> I don't think addressing time zones in the first pass is strictly
>>> necessary, but as long as we store timestamps as UTC, we can also put
>>> the time zone in the KeyValue metadata.
>>>
>>> I'm not sure about the Interval type -- let's create a JIRA and tackle
>>> that in a separate discussion. I agree that it merits inclusion as a
>>> logical type, but I'm not sure what storage representation makes the
>>> most sense (e.g. is is not clear to me why Parquet does not store the
>>> interval as an absolute number of milliseconds; perhaps to accommodate
>>> month-based intervals which may have different absolute lengths
>>> depending on where you start).
>>>
>>> Let me know what you think, and if others have thoughts I'd be interested too.
>>>
>>> thanks,
>>> Wes
>>>
>>> [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs
>>> [2] : https://github.com/apache/arrow/blob/master/format/Message.fbs#L51
>>> [3]: https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs#L78
>>> [4]: https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift
>>>
>>> On Tue, Jun 21, 2016 at 1:40 PM, Uwe Korn <uw...@xhochy.com> wrote:
>>>> Hello,
>>>>
>>>> in addition to categoricals, we also miss at the moment a conversion from
>>>> Timestamps in Pandas/NumPy to Arrow. Currently we only have two (exact)
>>>> resolutions for them: DATE for days and TIMESTAMP for milliseconds. As
>>>> https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html notes there
>>>> are several more. We do not need to cater for all but at least some of them.
>>>> Therefore I have the following questions which I like to have solved in some
>>>> form before implementing:
>>>>
>>>> * Do we want to cater for other resolutions?
>>>> * If we do not provide, e.g. nanosecond resolution (sadly the default
>>>>   in Pandas), do we cast with precision loss to the nearest match? Or
>>>>   should we force the user to do it?
>>>> * Not so important for me at the moment: Do we want to support time zones?
>>>>
>>>> My current objective is to have them for Parquet file writing. Sadly this
>>>> has the same limitations. So the two main options seem to be
>>>>
>>>> * "roundtrip will only yield correct timezone and logical type if we
>>>>   read with Arrow/Pandas again (as we use "proprietary" metadata to
>>>>   encode it)"
>>>> * "we restrict us to milliseconds and days as resolution" (for the
>>>>   latter option, we need to decide how graceful we want to be in the
>>>>   Pandas<->Arrow conversion).
>>>>
>>>> Further datatype we have not yet in Arrow but partly in Parquet is timedelta
>>>> (or INTERVAL in Parquet). Probably we need to add another logical type to
>>>> Arrow to implement them. Open for suggestions here, too.
>>>>
>>>> Also in the Arrow spec there is TIME which seems to be the same as TIMESTAMP
>>>> (as far as the comments in the C++ code goes). Is there maybe some
>>>> distinction I'm missing?
>>>>
>>>> Cheers
>>>>
>>>> Uwe
>>>>
>>