You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Ryan Blue <bl...@cloudera.com> on 2014/06/12 04:01:45 UTC

Date and Time logical types

I've been looking at the proposed date and time logical types and I have 
a few questions. Here are the proposed logical types:

Date and Time:
* date: truncated julian day (int32)
* time_milli: (milliseconds since midnight) int32
* time_micro: (microseconds since midnight) int64
* interval (proposed 12 byte, Nong to review)

Timestamps, Always stored as epoch time. (units since utc jan 1, 1970). 
Can also be annotated with ISO time zone string in footer.
* timestamp_milli int64 (milliseconds)
* timestamp_micro int64 (microseconds)

Could we use the same epoch for both date and the timestamps? I think it 
will get confusing for implementations if we use Julian epoch for dates 
and Unix epoch for timestamp. I like that the timestamp_milli proposed 
is familiar to most java developers because both java.util.Date and 
Joda's Instant are backed by it. Will Unix epoch work for date?

Why is the maximum precision in microseconds? Both previous proposals 
used nanoseconds instead. The gain seems to be that timestamp_micro fits 
in an int64, but that means that the time_micro type is only using 5 
bits of the extra 4 bytes used to store it.

One solution I'd like to consider is what Apache Phoenix does. Phoenix 
uses a separate 4 bytes to store a nanosecond offset (20 bits). This 
would enable ignoring the nanoseconds in some cases, like for most 
comparisons in filters. It would take no more space than the time_micro 
type and would require another 4 bytes for the timestamp equivalent, but 
you'd get nanosecond precision.

Another win for the 4-byte nanosecond offset is that we are more likely 
to be able to use the same representation in the HBase type spec, since 
Phoenix already has data using this type and will almost certainly 
require nanosecond precision.

Thanks for putting together this proposal, it's great to see how close 
this is getting to finished.

rb

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Date and Time logical types

Posted by Ryan Blue <bl...@cloudera.com>.

On 07/10/2014 01:26 PM, Nong Li wrote:
> On Wed, Jun 11, 2014 at 7:25 PM, Jacques Nadeau <jacques@apache.org
> <ma...@apache.org>> wrote:
>
>     As far as truncated julian day versus unix epoch, I thought they
>     started at the same time which is why I suggested it.  Upon further
>     looking, I realize they do not.  As such, I guess the best option is
>     is unix epoch divided by 86400.
>
> I don't think anyway feels too strongly about which one we pick but I
> agree we should pick the same for both.
> Ryan & Jacques: you guys seem to have a stronger opinion on this.
> Jacques, I can't tell from your previous email
> if we've got consensus now.

The pull request, #3, uses Unix epoch now. I think we have consensus.

>         Why is the maximum precision in microseconds? Both previous
>         proposals used nanoseconds instead. The gain seems to be that
>         timestamp_micro fits in an int64, but that means that the
>         time_micro type is only using 5 bits of the extra 4 bytes used
>         to store it.
>
>         One solution I'd like to consider is what Apache Phoenix does.
>         Phoenix uses a separate 4 bytes to store a nanosecond offset (20
>         bits). This would enable ignoring the nanoseconds in some cases,
>         like for most comparisons in filters. It would take no more
>         space than the time_micro type and would require another 4 bytes
>         for the timestamp equivalent, but you'd get nanosecond precision.
>
> How do you propose adding those 4 bytes? I don't want to introduce
> "compound single column types".
> What if we added time_nano and timestamp_nano and used
> fixed_len_byte_array as the underlying
> storage type.

I'm proposing we replace the _micro types with _nano types. time_nano 
will fit in an int64, but timestamp_nano will not. I propose we store 
timestamp_nano as a 12-byte fixed, with the first 8 bytes used to encode 
the time in milliseconds and the remaining 4 used to store the 
nanosecond offset. Both values should use big-endian.

rb

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Date and Time logical types

Posted by Jacques Nadeau <ja...@apache.org>.

As far as truncated julian day versus unix epoch, I thought they started at
the same time which is why I suggested it.  Upon further looking, I realize
they do not.  As such, I guess the best option is is unix epoch divided by
86400.


On Wed, Jun 11, 2014 at 7:01 PM, Ryan Blue <bl...@cloudera.com> wrote:

> I've been looking at the proposed date and time logical types and I have a
> few questions. Here are the proposed logical types:
>
> Date and Time:
> * date: truncated julian day (int32)
> * time_milli: (milliseconds since midnight) int32
> * time_micro: (microseconds since midnight) int64
> * interval (proposed 12 byte, Nong to review)
>
> Timestamps, Always stored as epoch time. (units since utc jan 1, 1970).
> Can also be annotated with ISO time zone string in footer.
> * timestamp_milli int64 (milliseconds)
> * timestamp_micro int64 (microseconds)
>
> Could we use the same epoch for both date and the timestamps? I think it
> will get confusing for implementations if we use Julian epoch for dates and
> Unix epoch for timestamp. I like that the timestamp_milli proposed is
> familiar to most java developers because both java.util.Date and Joda's
> Instant are backed by it. Will Unix epoch work for date?
>
> Why is the maximum precision in microseconds? Both previous proposals used
> nanoseconds instead. The gain seems to be that timestamp_micro fits in an
> int64, but that means that the time_micro type is only using 5 bits of the
> extra 4 bytes used to store it.
>
> One solution I'd like to consider is what Apache Phoenix does. Phoenix
> uses a separate 4 bytes to store a nanosecond offset (20 bits). This would
> enable ignoring the nanoseconds in some cases, like for most comparisons in
> filters. It would take no more space than the time_micro type and would
> require another 4 bytes for the timestamp equivalent, but you'd get
> nanosecond precision.
>
> Another win for the 4-byte nanosecond offset is that we are more likely to
> be able to use the same representation in the HBase type spec, since
> Phoenix already has data using this type and will almost certainly require
> nanosecond precision.
>
> Thanks for putting together this proposal, it's great to see how close
> this is getting to finished.
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>